Observability and Evals: How Do You Know Your Agents Are Behaving Correctly?
Observability and Evals: How Do You Know Your Agents Are Behaving Correctly?
Part 6 of a series on building production-scale agent platforms
This is the final article in the series. We have covered a lot of ground.
Part 1 laid out why hosting agents at production scale is fundamentally different from hosting microservices, and compared the architectural philosophies of AWS, Google, and Microsoft. Part 2 went deep into AgentCore's runtime, covering how agents get versioned, deployed, and rolled back. Part 3 covered Gateway, the mediation layer that controls every interaction between an agent and the outside world. Part 4 explored memory, the thing that gives agents continuity, personality, and the ability to build on prior interactions. Part 5 covered identity and policies, the governance layer that makes everything else safe to use.
All of that infrastructure exists to support one goal: running agents that are correct, reliable, and trustworthy. But none of it tells you whether you have achieved that goal. You can have perfect deployment pipelines, airtight security policies, and beautifully scoped memory systems, and your agent can still be confidently wrong in ways that damage your users and your business.
This article is about the hardest problem in production AI: knowing whether your agents are behaving correctly. Not whether they are running. Not whether they are fast. Whether they are right.
Why Agent Observability Is Different
With a traditional microservice, observability is well-understood. You monitor latency, error rates, throughput, and resource utilization. You set up alerts for anomalies. You use distributed tracing to follow requests across service boundaries. When something goes wrong, you look at the logs, identify the error, and fix the code. The failure modes are mechanical: a service crashed, a query timed out, a dependency returned an unexpected response.
Agent failure modes are different in kind, not just in degree.
An agent can return a 200 OK with a beautifully formatted response that is completely wrong. It can call the right tools in the right order and still produce an incorrect conclusion from the data it received. It can answer confidently when it should express uncertainty. It can be subtly biased in ways that only become visible over hundreds of interactions. It can degrade gradually as its input distribution shifts, never triggering a hard error but steadily producing lower-quality outputs.
Traditional observability catches mechanical failures. Agent observability must also catch cognitive failures: errors in reasoning, errors in judgment, errors in the quality of the output itself. And that requires a fundamentally different approach.
Traces vs Reasoning Traces
Distributed tracing (X-Ray, Jaeger, Zipkin) captures the mechanical flow of a request: which services were called, in what order, with what latency. This is necessary for agents but nowhere near sufficient.
A reasoning trace captures the cognitive flow: what the model was asked (the assembled prompt, including system instructions, retrieved memories, conversation history, and the user's input), what the model produced (including chain-of-thought reasoning where available), what tools the model decided to call and why, what data the tools returned, and how the model interpreted that data to produce its final output.
The difference is this: a distributed trace tells you that the agent called the session data API in 45ms and the assessment API in 62ms. A reasoning trace tells you that the agent retrieved three months of session data, noticed a downward trend in correct independent responses for the manding program, considered whether this constituted clinical regression, concluded that it did based on the 15% decline over two consecutive data collection periods, and recommended a program modification in the progress report.
When that recommendation is wrong (maybe the 15% decline was due to a new RBT who was still learning the data collection procedure, not actual client regression), you need the reasoning trace to understand why the agent reached the wrong conclusion. The distributed trace will not help you. The mechanical execution was flawless. The reasoning was flawed.
AgentCore captures both layers. The distributed trace flows through X-Ray, capturing timing, dependencies, and errors. The reasoning trace is captured as structured metadata within the X-Ray trace, including the full prompt assembly, model outputs, and tool call sequences. Both are searchable, and both link to the same correlation ID.
What Gets Logged
Let me be specific about what the observability pipeline captures for each agent execution.
Prompt assembly log. The complete prompt that was sent to the model, including: the system prompt (which varies by agent version), the retrieved memories (which memories were pulled and from which stores), the conversation history (including any summarization that occurred), and the user's current input. This is critical for debugging because the prompt is the single most important input to the model. If the prompt was assembled incorrectly (wrong memories, truncated history, missing context), the output will be wrong regardless of the model's capability.
Model inference log. The model's raw output, including any chain-of-thought or scratchpad content, the tool call decisions (which tools the model decided to call and with what parameters), and the final response. Token counts are logged here too: input tokens, output tokens, and the model version used. This is the basis for cost attribution.
Tool call logs. For each tool call: the tool name, the input parameters (with sensitive fields redacted per Gateway policy), the response (again with redaction), the latency, and any errors. These logs are produced by Gateway, which means they include the policy evaluation results (was the call allowed? was it rate-limited? was the input schema valid?).
Memory operation logs. Which memories were retrieved, which were stored, which were updated. Memory retrieval logs include the similarity scores, so you can see whether the retrieved memories were highly relevant or borderline.
Output log. The final response delivered to the user, along with any structured metadata (confidence scores, citation sources, flagged uncertainties).
All of this is expensive to store at full fidelity for every execution. In practice, most teams use a tiered approach: full fidelity logging for a configurable sample of executions (5-10%), summary logging for all executions, and triggered full logging for executions that match anomaly criteria (high latency, tool call errors, user-reported issues).
Evaluation: The Core of Agent Quality
Observability tells you what happened. Evaluation tells you whether what happened was good. These are different disciplines, and conflating them is one of the most common mistakes teams make.
Evaluation for agents comes in two forms: offline (before deployment) and online (in production). Both are necessary. Neither is sufficient alone.
Offline Evals
Offline evaluation runs before an agent version reaches production traffic. It is the quality gate in your CI/CD pipeline, and it needs to answer: "is this version at least as good as the current production version?"
The challenge is constructing an evaluation dataset and scoring function that meaningfully predicts production quality. This is harder than it sounds.
Evaluation datasets should reflect the real distribution of production inputs, including edge cases, adversarial inputs, and the long tail of unusual requests. For Lumen Health, this means the evaluation set includes straightforward progress reports (high-performing clients, stable programs) but also difficult cases: clients with inconsistent session attendance, programs with ambiguous mastery criteria, assessment results that contradict the session data trends.
For Meridian Capital (our financial institution from Part 5), the evaluation set includes standard loan applications but also edge cases: self-employed applicants with irregular income documentation, applications near the approval threshold where the recommendation could go either way, applications with conflicting signals (strong income but high debt-to-income ratio).
Scoring functions must capture multiple quality dimensions. A single aggregate score is not enough, because an agent can improve on one dimension while regressing on another. For Lumen Health, the scoring dimensions include:
- Factual accuracy. Do the quantitative claims in the report (session counts, mastery percentages, trend descriptions) match the underlying data? This is checkable deterministically by comparing the agent's claims against the source data.
- Clinical soundness. Are the interpretations and recommendations clinically appropriate? This is harder to automate and typically uses an LLM-as-judge approach, where a separate model evaluates the agent's output against clinical criteria.
- Completeness. Does the report cover all required sections? Did the agent address all active programs, or did it miss some?
- Hallucination rate. Did the agent claim things that are not supported by the data? This is the most critical metric for clinical documentation.
- Format compliance. Does the report follow the payer's required format and include required language?
Each dimension gets its own score, and the promotion criteria specify minimum thresholds for each. A new version can only be promoted if it meets or exceeds the current version on all dimensions, not just on average.
Running offline evals is itself computationally expensive. If your evaluation set has 500 cases and each case requires a full agent execution (model inference, tool calls, report generation), you are looking at significant model inference costs and wall-clock time. Parallelize aggressively. Use smaller, cheaper models for the LLM-as-judge scoring when the scoring criteria are well-defined. Cache tool call responses (since the evaluation dataset uses the same underlying data every time).
Online Evals
Offline evals predict production quality. Online evals measure it. The gap between prediction and reality is always larger than you expect, because production traffic is messier, more diverse, and more adversarial than any evaluation dataset.
Online evaluation runs continuously against live production traffic. There are several approaches, and most teams end up using a combination.
Automated scoring on production outputs. Sample a percentage of agent executions and run the same scoring functions you use in offline evals. This catches gradual quality degradation that would not trigger any individual alert.
For Lumen Health, the automated pipeline samples 10% of generated progress reports and scores them on factual accuracy and hallucination rate. These scores are tracked as time series metrics in CloudWatch, with alerts if the 7-day rolling average drops below the threshold.
User feedback signals. When a BCBA accepts a generated report without edits, that is a positive signal. When they regenerate a section, that is a negative signal. When they manually rewrite a section, that is a strong negative signal. These implicit feedback signals, aggregated over time, provide a real-world quality metric that complements automated scoring.
The tricky part is that user feedback is noisy. A BCBA might edit a report for stylistic preferences, not because the content was wrong. Or they might accept a report that has a subtle error because they are busy and did not review it carefully. Implicit feedback needs to be interpreted statistically, not treated as ground truth for individual cases.
Explicit feedback collection. For high-stakes outputs, ask the user directly. After generating a report, ask the BCBA: "Is this report clinically accurate and complete?" with options for "yes," "mostly but I made edits," and "no, there are significant issues." This is higher-quality signal but lower volume, since most users will skip the feedback step if it is not quick and unintrusive.
Shadow Deployments
Shadow deployment is a technique where a new agent version processes production traffic but its outputs are not shown to users. Instead, the outputs are logged and scored alongside the current production version's outputs. This lets you compare two versions on real production traffic without any risk to users.
For Meridian Capital, shadow deployment is how they evaluate major agent changes. When testing a new risk scoring model integration, the shadow version processes every loan application that the production version handles. Both versions' recommendations are scored against the eventual human underwriter decision (the ground truth that arrives days or weeks later). After accumulating enough data, the team compares: did the shadow version agree with the human decision more often? Did it flag fewer false positives? Did it miss any cases that the human flagged?
The cost of shadow deployment is that you are paying for double the model inference. For Meridian, the cost is justified by the risk reduction. A bad recommendation from the loan agent has regulatory and financial consequences that far exceed the inference cost of a shadow run.
AgentCore supports shadow deployment as a deployment configuration option. You can route 100% of traffic to the production version for user-facing responses while simultaneously routing the same traffic to a shadow version for evaluation-only processing.
Human-in-the-Loop Review
For the highest-stakes decisions, automated evaluation is not enough. You need human reviewers who understand the domain and can assess the agent's output with the judgment that automated metrics cannot capture.
Lumen Health runs a weekly clinical review where a senior BCBA reviews a random sample of 20 agent-generated reports. The reviewer scores each report on clinical accuracy, appropriate use of ABA terminology, quality of the clinical interpretation, and whether the recommended program modifications are appropriate. This is expensive (it takes the reviewer about 3 hours per week) but it catches failure modes that automated scoring misses, particularly around clinical nuance and judgment.
Meridian Capital runs a similar process with their compliance team. A compliance officer reviews a sample of the agent's underwriting recommendations each week, checking that the reasoning is consistent with regulatory requirements and internal risk policies.
The key to making human review sustainable is sampling strategy. You do not review every output. You stratify your sample: oversample edge cases (applications near the approval threshold, reports for clients with complex presentations), undersample routine cases, and always include any cases where the automated scoring flagged potential issues.
A Production Failure, End to End
Let me walk through a realistic failure scenario to show how all of this fits together.
Lumen Health deploys version 3.2 of the progress report agent. The offline evaluation passed on all dimensions. The canary deployment looks clean after 24 hours. Traffic is promoted to 100%.
Three days later, the online evaluation pipeline flags a problem. The hallucination rate on factual accuracy has increased from 0.4% to 2.1%. This is above the alerting threshold of 1.5%.
The on-call engineer starts investigating.
Step 1: Identify the scope. Using the CloudWatch dashboard, they filter by agent version (3.2), time range (the last 72 hours), and the hallucination flag. They find 47 flagged reports out of roughly 2,200 generated.
Step 2: Examine the traces. They pull the reasoning traces for a sample of the flagged reports. The traces show that the agent is correctly retrieving session data, but in the interpretation step, it is claiming mastery for programs where the data shows the client is still in acquisition phase. The agent is interpreting "80% correct across last 3 sessions" as mastery, when the actual mastery criterion for these programs is "80% correct across 5 consecutive sessions."
Step 3: Root cause. Comparing the v3.2 prompt to v3.1, the engineer finds the issue. A prompt refinement in v3.2 shortened the mastery criteria description from "80% correct across the number of consecutive sessions specified in the program's mastery criteria field" to "80% correct across recent sessions." The abbreviated language caused the model to default to its own interpretation of "recent" (3 sessions) rather than looking up the program-specific mastery criteria.
Step 4: Containment. The engineer triggers a rollback to v3.1 via the deployment configuration. AgentCore shifts 100% of traffic to v3.1 within seconds. The hallucination rate returns to 0.4% within the hour.
Step 5: Impact assessment. Using the data plane audit logs, the team identifies all 47 reports that were affected. They cross-reference with the user feedback signals and find that 31 of the reports were accepted by BCBAs without edits (meaning the error was not caught by the clinician). These 31 reports need manual review and potential correction.
Step 6: Remediation. The team fixes the prompt, adds specific test cases for mastery criteria interpretation to the offline evaluation suite, and deploys v3.3 through the full canary pipeline.
Total time from detection to containment: 45 minutes. Total time from detection to root cause: 2 hours. Total affected reports: 47, of which 31 require follow-up.
Without the observability pipeline, this failure would have been invisible. The agent was producing well-formatted, confident reports. No tool calls failed. No errors were thrown. The only signal was a subtle shift in the accuracy of clinical claims, caught by automated scoring that compared the agent's statements against the source data.
Cost Monitoring and Token Economics
Agent costs are dominated by model inference, and model inference costs are driven by token consumption. Understanding and managing token economics is an operational discipline that most teams underinvest in.
Input tokens are the most controllable cost lever. They are determined by the prompt assembly: system prompt length, retrieved memories, conversation history, and tool call results. Every memory you retrieve and every conversation turn you include adds input tokens to every subsequent model call. Over a multi-turn, multi-tool-call execution, the input token count compounds.
For Lumen Health, a typical progress report generation uses approximately:
- System prompt: 1,500 tokens
- Retrieved memories: 800 tokens
- Session data from tools: 3,000 to 8,000 tokens (depending on how many programs the client has)
- Conversation history: 500 to 2,000 tokens (depending on how many revision cycles)
- Assessment data: 1,200 tokens
Total input context: 7,000 to 13,500 tokens per model call, with 4 to 6 model calls per report generation. That is 28,000 to 81,000 input tokens per report.
Output tokens are less controllable (they depend on the model's response length) but more expensive per token. The progress report itself is typically 2,000 to 4,000 output tokens, plus shorter outputs for the intermediate reasoning and tool-call-decision steps.
Cost attribution needs to work at multiple levels: per-execution (for debugging and optimization), per-user (for quota enforcement), per-organization (for billing and capacity planning), and per-agent-version (for comparing the efficiency of different versions). AgentCore's token logging, combined with the deployment version tagging, enables all of these attribution dimensions.
Cost optimization strategies:
- Prompt compression. Regularly audit prompt templates and eliminate redundant instructions. A 20% reduction in system prompt length saves 20% on input tokens across every model call.
- Selective memory retrieval. Do not retrieve memories for every execution. Profile which memory types actually influence output quality, and skip the ones that do not. As we discussed in Part 4, not every retrieved memory contributes to the output.
- Tiered models. Use a cheaper, faster model for intermediate reasoning steps (tool selection, data interpretation) and a more capable model for the final output generation. Not every step in the agent's execution requires the most expensive model.
- Caching. For tool calls that return static or slowly-changing data (payer documentation requirements, regulatory rules), cache the results and reuse them across executions rather than re-fetching and re-injecting every time.
- Batch processing. For non-interactive workloads (like generating all monthly progress reports for an organization), batch execution allows token-level optimizations like sharing common context across multiple reports.
Alerting thresholds should cover both cost and quality:
- Per-execution cost exceeding 2x the historical median (indicates a possible reasoning loop or excessive tool calling)
- Per-organization daily cost exceeding budget allocation (indicates unexpected usage spike)
- Cost-per-quality-point ratio increasing (the agent is getting more expensive without getting better, suggesting optimization opportunities)
- Token utilization ratio (output tokens / input tokens) dropping below historical baseline (the agent is consuming more context but producing less useful output)
Drift Detection: The Long Game
We touched on behavior drift in Part 2, but it deserves deeper treatment in the context of observability.
Drift is the gradual change in agent behavior that occurs without any deployment. It has several causes:
Input distribution shift. The types of requests the agent receives change over time. For Lumen Health, this might be seasonal: more reports in January (annual reviews), more complex cases during school transition periods, different data patterns when new RBTs join organizations in September.
Data drift. The data the agent accesses through its tools changes. New assessment instruments are adopted. Payers update their documentation requirements. The practice management system adds new data fields or changes the format of existing ones.
Memory accumulation. As the agent builds up more memories, its behavior shifts. A procedural memory learned from one BCBA's feedback might gradually influence recommendations for all BCBAs, subtly homogenizing the agent's output style.
Model drift. Even with pinned model versions, hosting providers occasionally update model weights within a version (bug fixes, safety patches). These changes are usually minor but can compound with other drift sources.
Detecting drift requires baseline metrics that are continuously compared against current performance. The key metrics to track:
- Output distribution. The statistical distribution of output characteristics (length, structure, vocabulary, recommendation types). A shift in this distribution, even if individual outputs look fine, may indicate drift.
- Tool call patterns. Changes in which tools are called, how often, and in what order. If the agent starts calling the assessment tool less frequently than its baseline, something has changed in its reasoning about when assessment data is relevant.
- Quality scores over time. The automated scoring metrics from online evaluation, tracked as time series with trend analysis. A slow downward trend over weeks is drift. A sudden drop is more likely a specific incident.
- User behavior signals. Changes in the rate of report acceptance, edit frequency, and regeneration requests. These are noisy but, averaged over weeks, can detect gradual degradation that automated scoring might miss.
When drift is detected, the response depends on the severity. Mild drift (quality scores still above threshold but trending down) triggers a review: is the drift acceptable, or does it warrant a prompt adjustment or evaluation set refresh? Severe drift (quality scores approaching or crossing thresholds) triggers the same incident response as a deployment failure: containment, investigation, remediation.
Looking Forward: Agent Platforms in the Next 3 to 5 Years
We are in the early days of production agent infrastructure. The platforms we have been discussing throughout this series represent the first generation of purpose-built agent hosting. They are real and usable today, but the landscape will look very different in a few years.
Agent-native infrastructure will become the default, not the exception. Today, most teams run agents on general-purpose compute (Kubernetes, Lambda) with custom orchestration. Within two to three years, managed agent runtimes will be as standard as managed container orchestrators are today. The operational patterns we have discussed (versioning, canary deployment, traffic splitting, memory management, identity propagation) will be built into the platform layer, not custom-built by each team.
Evaluation will become a first-class infrastructure concern. Right now, evaluation pipelines are mostly custom-built and loosely integrated with the deployment pipeline. Within the next generation, evaluation will be as integrated as testing is in traditional CI/CD. You will not be able to deploy an agent without passing an evaluation suite, just as you cannot deploy a service without passing tests. The evaluation tooling will mature to include standardized benchmarks, domain-specific scoring rubrics, and continuous monitoring that is as easy to set up as a CloudWatch alarm.
Multi-agent architectures will force new infrastructure patterns. We have focused on single agents throughout this series. But the industry is moving toward systems where multiple specialized agents collaborate. A clinical agent that generates a report might hand off to a billing agent that generates the insurance authorization, which coordinates with a scheduling agent that books the next assessment. These multi-agent workflows need orchestration infrastructure that is more sophisticated than what any current platform provides: shared state across agents, transactional semantics for multi-agent workflows, and observability that spans the entire agent collaboration.
Regulatory frameworks will catch up. Governments are beginning to regulate AI systems, and agents that take consequential actions (clinical recommendations, financial decisions, hiring assessments) will face increasing regulatory scrutiny. The governance infrastructure we discussed in Part 5 (audit trails, access reviews, incident response) will become mandatory, not optional. Platforms that bake compliance into the infrastructure layer will have a significant advantage over those that treat it as an afterthought.
The cost curve will shift. Model inference costs have been dropping steadily, and that trend will continue. But the operational costs of running agents (evaluation, monitoring, memory management, compliance) will become a larger share of the total cost. Teams that invest in operational efficiency early will have a structural advantage as agent deployments scale.
Closing Thoughts
We started this series with a simple observation: most teams are building AI agents, but very few are building them for production scale. Over six articles, we have tried to show what production scale actually means. Not just more compute. Not just higher availability. Production scale means the full stack of concerns that arise when an autonomous, non-deterministic system takes consequential actions in the real world.
For Lumen Health, "consequential" means clinical documentation that affects a child's treatment plan and a family's access to services. For Meridian Capital, it means lending decisions that affect people's financial lives. For any organization deploying agents in production, "consequential" means something specific to their domain, something that demands more than a demo-quality prototype and a hope that things work out.
The infrastructure we have covered across this series (runtime, deployments, Gateway, memory, identity, observability) exists to manage the gap between what agents can do and what agents should do. That gap is where the hard work lives. The models will keep getting better. The frameworks will keep getting easier. But the operational discipline of running agents responsibly, knowing what they are doing, knowing why, and knowing when they get it wrong, that discipline does not come from the model or the framework. It comes from the platform you build around them and the rigor you bring to operating it.
If there is one idea I hope persists from this series, it is this: the measure of a production AI system is not how impressive it is when it works. It is how gracefully it fails, how quickly you detect the failure, and how confidently you can explain what happened. Build for that, and the impressive parts will take care of themselves.
This concludes the six-part series on production-scale agent hosting. The series covered the problem space, AgentCore runtime and deployments, Gateway and tool integration, memory management, identity and policy enforcement, and observability and evaluation. Thank you for reading.