Inside AWS Bedrock AgentCore: Runtime, Versioning, and Safe Rollouts

12/21/2025Pradeep
AIAgentsAWSInfrastructureAgentCore

Inside AWS Bedrock AgentCore: Runtime, Versioning, and Safe Rollouts

Part 2 of a series on building production-scale agent platforms

In Part 1, we walked through why hosting agents in production is a fundamentally different problem than hosting stateless microservices. We covered the six hard problems (state management, versioning, A/B testing, identity, observability, secure tool integration) and compared architectural philosophies across AWS, Google, and Microsoft. We used the example of Lumen Health, a behavioral health platform deploying an AI agent to help BCBAs generate client progress reports from months of ABA session data.

That article stayed at the architectural level. Now we go deeper.

This piece is about what actually happens when an agent runs inside AWS Bedrock AgentCore. How does the runtime execute agents? How do you package and deploy them? How do you run multiple versions simultaneously, shift traffic between them, and roll back when something goes wrong? And why is all of this harder for agents than it is for traditional services?

What the AgentCore Runtime Actually Is

AgentCore is not a framework for writing agents. It is a managed execution environment for running them. That distinction matters.

Frameworks like LangChain, LangGraph, CrewAI, or AutoGen help you define agent logic: the prompt structure, the tool-calling patterns, the orchestration flow. AgentCore sits underneath all of that. It provides the compute isolation, the state management, the networking, the credential injection, and the deployment machinery that your agent needs to run reliably at scale. Think of it as the Fargate equivalent for agent workloads. You bring the agent, AgentCore provides the operational plane.

The runtime is built around a few core concepts:

Agent artifacts are the deployable units. An artifact contains everything needed to run your agent: the code, the model configuration, the tool definitions, the system prompt, and the orchestration logic. Artifacts are immutable. Once you publish one, it does not change. This is intentional. Immutability is what makes rollbacks possible.

Agent runtimes are the managed execution environments where your artifacts actually run. Each runtime is an isolated compute boundary with its own memory, network policies, and IAM role. When a request comes in, AgentCore routes it to the appropriate runtime, provisions any required state, and manages the full lifecycle of the agent's execution, including multi-turn conversations that span multiple invocations.

Deployment configurations define how traffic flows to different artifact versions. This is where versioning, A/B testing, and canary rollouts live. A deployment configuration can split traffic across multiple artifact versions by percentage, route specific users to specific versions, or gradually shift traffic from one version to another.

How AgentCore Executes Agents

When a request hits an AgentCore-hosted agent, here is what actually happens:

First, the deployment configuration resolves which artifact version should handle this request. If you are running a 90/10 canary split, this is where the dice roll happens. If you have user-level routing rules (say, all users from Organization X get version 2), those rules evaluate here.

Second, AgentCore provisions the execution context. This includes loading any persisted conversation state from the session store, injecting the appropriate IAM credentials for tool access, and setting up the observability pipeline that will capture the full execution trace.

Third, your agent code runs. This is your LangChain chain, your LangGraph graph, your custom orchestration loop, whatever you built. From your code's perspective, it is running in a normal compute environment. It can call models via Bedrock, invoke tools, read and write state. AgentCore handles the underlying infrastructure: the model endpoint routing, the credential rotation, the retry logic for transient failures.

Fourth, when execution completes (or pauses waiting for user input in a multi-turn conversation), AgentCore persists the updated conversation state, flushes the execution trace to the observability backend, and returns the response.

The important thing here is that your agent code does not need to know about any of this operational machinery. It does not manage its own state persistence. It does not handle credential injection. It does not implement traffic routing. That separation of concerns is what makes the runtime model work. Your team writes the clinical logic. AgentCore runs it safely.

Packaging and Deploying Agent Artifacts

There are two primary packaging models for getting your agent into AgentCore.

S3-Based Artifacts

The simpler path. You package your agent code and configuration into a structured bundle and upload it to S3. The bundle includes your agent's entry point, its dependencies, the system prompt, tool definitions, and a manifest file that tells AgentCore how to wire everything together.

This model works well for agents built primarily in Python using frameworks like LangChain or LangGraph. You package a virtual environment (or use Lambda-style layers for shared dependencies), point AgentCore at the S3 URI, and it handles the rest. The workflow looks roughly like:

  1. Your CI pipeline runs tests and evaluations against the agent
  2. On success, it packages the artifact and uploads to a versioned S3 path
  3. It calls AgentCore's API to create a new artifact version pointing to that S3 location
  4. It updates the deployment configuration to route a percentage of traffic to the new version

The advantage is simplicity. The disadvantage is that you are constrained to the runtime environments AgentCore provides. If your agent has unusual system dependencies or needs a specific binary toolkit, you may hit limits.

Container-Based Artifacts via ECR

The more flexible path. You build a Docker image that contains your agent, push it to ECR, and tell AgentCore to run it. This gives you full control over the runtime environment: you can install whatever system packages you need, use any language or framework, and bundle custom binaries.

This model is natural for teams that already have containerized CI/CD pipelines. Your Dockerfile defines the agent environment, your build pipeline produces an image, pushes to ECR, and triggers a deployment update. AgentCore pulls the image and runs it in its managed infrastructure, applying the same isolation, credential injection, and state management as the S3 model.

The tradeoff is that you own more of the environment configuration. AgentCore still handles orchestration, networking, and state, but the internals of the container are your responsibility. If your base image has a vulnerability, you need to patch it. If your dependency resolution breaks, you need to debug it.

For Lumen Health, the ABA platform from our earlier example, the container model makes more sense. Their progress report agent integrates with several internal services (session data APIs, assessment data stores, treatment plan databases) and depends on some custom data processing libraries that are specific to their clinical data pipeline. Packaging all of that in a container gives them full control over the environment while still benefiting from AgentCore's operational infrastructure.

CI/CD Integration Patterns

The deployment pipeline for an agent is more complex than for a typical service, because the "test" phase is fundamentally harder. With a REST API, your CI pipeline runs unit tests and integration tests that assert on deterministic outputs. With an agent, you need evaluation pipelines that assess non-deterministic behavior.

A practical CI/CD pipeline for an AgentCore-hosted agent looks like:

  1. Code and configuration changes are committed to the repository. This includes prompt changes, tool definition updates, orchestration logic changes, and model version pins.
  2. Unit tests validate individual components: tool implementations, data transformers, prompt template rendering.
  3. Evaluation suite runs the agent against a curated dataset of inputs and expected behaviors. Not exact string matches, but structural and semantic evaluations. Did the agent call the right tools? Did it produce clinically accurate summaries? Did it avoid hallucination on known tricky inputs? These evaluations typically use a combination of LLM-as-judge scoring and deterministic rule checks.
  4. Artifact packaging bundles everything and pushes to S3 or ECR.
  5. Deployment update tells AgentCore about the new artifact version and sets the initial traffic allocation (usually a small canary percentage).
  6. Automated monitoring watches the canary for quality regressions, error rate spikes, or latency degradation.
  7. Traffic promotion gradually increases the new version's traffic share if metrics hold.

Steps 5 through 7 are where AgentCore's deployment machinery earns its keep. You are not scripting traffic shifts in your own infrastructure. You are declaring the desired state and letting the platform manage the transition.

Versioning: Why This Is Harder for Agents Than APIs

With a REST API, versioning is primarily about the schema. If the request and response shapes are compatible, you can deploy with confidence. The behavior is deterministic: same input, same output. You can diff the outputs of two versions against a test suite and know exactly what changed.

Agent versioning is harder in three specific ways.

First, the "version" is a composite of many dimensions. A single agent version encompasses the model version, the system prompt, the tool definitions, the orchestration logic, the retrieval configuration (if using RAG), and the guardrails. Changing any one of these changes the agent's behavior. You need a versioning strategy that captures all of these as a single atomic unit. AgentCore's artifact model does this by treating the entire bundle as immutable and versioned, but you still need discipline in your team's workflow to ensure that every meaningful change produces a new artifact version, not an in-place mutation.

Second, the behavior is non-deterministic. The same input to the same agent version can produce different outputs on consecutive runs. This means you cannot simply compare outputs between versions to assess whether a change is safe. You need statistical evaluation over many runs. If version 2 of your progress report agent produces hallucinated mastery claims on 0.3% of reports, you will not catch that by running five test cases. You need hundreds, evaluated by both automated scoring and clinical review.

Third, behavior drift happens without code changes. If you pin the model version (and you should), this is less of a risk. But if your agent uses retrieval-augmented generation and the underlying document corpus changes (new payer guidelines, updated clinical protocols), the agent's behavior shifts even though no code was deployed. Your versioning strategy needs to account for this. One approach is to version the retrieval index alongside the agent artifact. Another is to run continuous evaluation against production traffic to detect drift.

Running Multiple Versions: A/B Testing and Canary Deployments

Let me make this concrete with Lumen Health again. Their progress report agent has been running in production for three months as version 1. The clinical team has identified two improvements they want to ship:

Version 2 updates the prompt to better handle edge cases around maintenance programs. Some BCBAs reported that v1 was inconsistent in how it described programs that had been mastered and moved to maintenance. The change is relatively low-risk, mostly prompt restructuring with the same underlying logic.

Version experimental (vNext) switches the underlying model from Claude 3.5 Sonnet to Claude 4 Opus, which the team believes will produce more nuanced clinical interpretations, particularly around behavior function analysis. This is a higher-risk change because it alters the core reasoning capability.

Here is how Lumen deploys this using AgentCore's deployment configuration:

Phase 1: Canary for v2. The deployment configuration routes 95% of traffic to v1 and 5% to v2. The evaluation pipeline monitors key metrics: clinical accuracy scores (computed by an LLM judge against gold-standard reports), hallucination rate (checked against actual session data), and BCBA satisfaction (measured by how often clinicians accept the report without edits versus requesting regeneration). After one week, if v2's metrics are equal to or better than v1's, they promote to 50/50, then to 100% over the following week.

Phase 2: A/B test for vNext. Once v2 is stable and promoted to the primary version, they set up an A/B test: 80% of traffic goes to v2, 20% goes to vNext. This is not a canary (where you are validating before full promotion). This is a genuine experiment. The clinical team wants to measure whether Claude 4 Opus actually produces better interpretations, or whether the improvements are marginal. They run this test for a full month to accumulate enough reports for statistical significance, with clinical reviewers scoring a sample of reports from each version on a rubric.

Phase 3: Decision. If vNext performs meaningfully better, it becomes v3 and gets promoted to 100%. If the difference is marginal, they stay on v2 and save the compute cost difference. If vNext regresses on any critical metric, they discard it.

Notice what AgentCore is doing for them here. Lumen's team is not building traffic splitting infrastructure. They are not implementing session-pinning logic to ensure a BCBA consistently hits the same version within a multi-turn report generation session. They are not building the observability pipeline that tags each report with the version that generated it. They are declaring "split traffic 80/20 between these two artifact versions" and letting the platform handle the routing, session affinity, and metric attribution.

Behavior Drift: The Risk Nobody Plans For

I want to spend a moment on behavior drift because I think it is the most underestimated production risk for agent systems.

With a traditional service, your behavior changes when you deploy new code. Between deployments, behavior is stable. You can point to a specific commit and say "this is what is running in production right now."

Agents do not work this way. Even with a pinned model version and an immutable artifact, the agent's behavior can shift if any of its external dependencies change. For Lumen's progress report agent, consider what happens when:

  • A payer updates their documentation requirements. The agent's RAG corpus now contains the new guidelines, and the agent starts structuring reports differently even though no agent code changed.
  • A new RBT joins a client's team and uses a slightly different data entry pattern for trial-by-trial data. The agent's interpretation of session data shifts because the input distribution changed.
  • The session data API team deploys an update that changes how null values are returned for canceled sessions. The agent starts including canceled sessions in its trend analysis.

None of these involve a new agent deployment. None of them would be caught by pre-deployment evaluation. All of them can affect the quality of clinical documents.

The defense is continuous evaluation. You need a pipeline that runs against live production outputs (or a representative sample) and checks for quality regressions. In Lumen's case, this means sampling a percentage of generated reports each week, scoring them against clinical quality rubrics, and comparing scores to historical baselines. If scores drop below a threshold, the pipeline alerts the team and optionally triggers a rollback to a prior artifact version while the team investigates.

AgentCore's observability integration makes this feasible. Every agent execution produces a structured trace that includes the full input context, tool call sequence, retrieval results, and output. You can build evaluation pipelines that consume these traces, score the outputs, and feed the results into your monitoring dashboard alongside the usual operational metrics.

Rollbacks: The Safety Net You Will Use

Rollbacks for agents are conceptually similar to rollbacks for services, but operationally trickier. With a service, you roll back to a prior container image and traffic immediately starts hitting the old code. The old code is deterministic, so you know exactly what behavior you are restoring.

With an agent, rolling back to a prior artifact version restores the model configuration, prompt, tools, and orchestration logic. But it does not restore the state of the world. If the rollback was triggered because a retrieval corpus change caused quality issues, rolling back the agent artifact will not fix the corpus. You need to think about rollbacks as restoring the agent's "decision-making logic" to a known-good state, while recognizing that the environment the agent operates in may have changed since that version was last active.

AgentCore supports instant rollback by shifting 100% of traffic to a prior artifact version. Because artifacts are immutable, the prior version is always available and can be activated immediately. There is no "rebuild and redeploy" step. The platform keeps prior versions warm (within configured retention limits) so that rollback latency is measured in seconds, not minutes.

For Lumen, the rollback workflow is: the evaluation pipeline detects a quality regression, it pages the on-call engineer, the engineer reviews the evaluation results and the recent execution traces, and if the regression is confirmed, they update the deployment configuration to route all traffic to the last known-good artifact version. Total time from detection to mitigation: minutes, not hours.

What This Looks Like in Practice

Let me sketch the concrete infrastructure for Lumen's setup:

Repository structure:

  • Agent code (Python, LangGraph-based orchestration)
  • System prompt templates (versioned alongside code)
  • Tool definitions (JSON schemas for each tool the agent can call)
  • Evaluation datasets (curated input/expected-output pairs, updated monthly)
  • Evaluation scoring logic (LLM-judge prompts and deterministic checks)
  • Deployment configuration (AgentCore artifact and traffic split definitions)

CI/CD flow (CodePipeline + CodeBuild):

  • On merge to main: run unit tests, run evaluation suite against the dev dataset
  • On evaluation pass: build container image, push to ECR, create new AgentCore artifact version
  • On artifact creation: update deployment config to route 5% canary traffic to new version
  • Automated promotion: if canary metrics hold for 48 hours, promote to 50%, then 100% over the next 48 hours
  • Automated rollback: if canary metrics degrade beyond threshold, revert to prior version immediately

Monitoring:

  • CloudWatch dashboards tracking: invocation count, latency (p50/p95/p99), error rate, tool call failure rate per version
  • X-Ray traces for every execution, searchable by version, organization, and outcome quality score
  • Weekly evaluation pipeline that scores a sample of production outputs against clinical quality rubrics
  • CloudTrail logs for all deployment configuration changes (who changed what, when)

This is not a weekend project. But it is also not six months of custom platform engineering. The hard parts (execution isolation, traffic routing, state management, credential injection, trace collection) are handled by AgentCore. Lumen's team focuses on the clinical logic, the evaluation criteria, and the deployment policies.

The Architectural Takeaway

AgentCore's value proposition is not that it makes agents easy. Agents in production are not easy, and any platform that claims otherwise is selling you something. The value is that it separates the concerns cleanly. Your team owns the agent logic and the quality criteria. The platform owns the execution environment and the deployment machinery.

This separation matters most during incidents. When a progress report comes out wrong and a BCBA flags it, you want to be debugging the clinical reasoning, not the infrastructure. You want to look at the execution trace and understand why the agent made the decision it made, not why the container crashed or the state was lost or the credentials expired. AgentCore does not prevent agent quality issues. It ensures that when they happen, the blast radius is contained and the debugging path is clear.

What Comes Next

We have covered the runtime model and the deployment machinery. We know how agents get packaged, versioned, deployed, tested, and rolled back inside AgentCore. But we have been treating the agent as a closed system that runs in isolation.

In practice, agents need to talk to the outside world. They need to call APIs, query databases, invoke other services, and sometimes talk to other agents. Each of those connections is a potential security boundary, a potential failure point, and a potential compliance concern. In a HIPAA-regulated environment like Lumen Health's, every outbound connection from the agent needs to be authenticated, authorized, audited, and rate-limited.

Now that we understand runtime and deployments, the next piece explores how AgentCore Gateway connects agents to the outside world safely.


This is Part 2 of a series on production-scale agent hosting. Part 1 covered the problem space and platform comparison. Part 3 will cover AgentCore Gateway and secure tool integration.