The Agent Hosting Problem Nobody Wants to Talk About

2/27/2026Pradeep
AIAgentsInfrastructureCloudABA

The Agent Hosting Problem Nobody Wants to Talk About

Part 1 of a series on building production-scale agent platforms

Everybody is building agents. Your competitors are building agents. Your platform team just got a Slack channel called #ai-agents. Someone in your org has already shipped a prototype that calls GPT-4 in a loop with some tools, and leadership is asking when it goes to production.

Here is the uncomfortable truth: getting an agent to work on your laptop is maybe 5% of the problem. The other 95% is everything that happens after you type "deploy."

We have spent the last two decades getting very good at hosting stateless services. We have container orchestrators, load balancers, blue-green deployments, circuit breakers. We know how to run a REST API at scale. But agents are not REST APIs. They are not microservices. And the patterns we have internalized for running reliable distributed systems break down in surprising ways when the thing you are hosting can reason, remember, and take actions with real consequences.

This is the first article in a series where we dig into what it actually takes to host agents in production. Not toy demos. Not weekend hackathons. Production systems that handle real patient data, real clinical outcomes, real compliance requirements.

Agents Are Not Microservices (And That Matters More Than You Think)

A stateless microservice receives a request, does some computation, returns a response. You can kill it at any time. You can run 50 copies behind a load balancer. If one crashes, the next request goes to another instance. The contract is simple: same input, same output, no side effects between requests.

Agents violate nearly every assumption in that model.

An agent maintains conversational state across turns. It builds up context over the course of an interaction, and that context changes its behavior. It calls external tools, which means a single "request" might trigger a chain of database writes, API calls, and file operations before the agent decides it has enough information to respond. The execution path is non-deterministic. Two identical inputs can produce different tool-calling sequences depending on the model's reasoning at inference time.

This means you cannot just throw agents behind a load balancer and call it a day. You need sticky sessions or externalized state. You need to handle partial failures mid-chain (what happens when the agent called two tools successfully but the third one timed out?). You need to think about idempotency for tool calls that might get retried. You need to reason about what "rollback" even means when an agent has already written a clinical note or submitted an insurance authorization.

If you have built event-driven systems or workflow orchestrators before, some of these problems will feel familiar. But the combination of non-deterministic control flow with real-world side effects creates a category of operational challenges that most teams are not prepared for.

The Six Hard Problems of Agent Hosting

Let me walk through the problems that hit you once you move past the prototype stage. These are not theoretical. They are what I see teams running into repeatedly, usually about three weeks after their first production deployment.

1. State and Memory Management

Every agent interaction involves at least two kinds of state: the short-term conversational context (what has happened in this session) and the long-term memory (what does this agent know about this user or this domain from prior interactions).

Short-term state is the easier problem. You can externalize it to Redis or DynamoDB. But you still need to handle session affinity, context window limits, and the question of what to do when a conversation exceeds the model's context length. Do you summarize? Do you truncate? Do you use a retrieval mechanism to pull in relevant prior turns? Each choice has tradeoffs for response quality and latency.

Long-term memory is where things get genuinely hard. If your agent is helping a BCBA review client progress, it needs to remember that this learner has been working on manding targets for three months, that their latest assessment showed regression on two maintenance programs, and that the supervising clinician flagged a potential behavior function change last week. That memory needs to be queryable, updatable, and scoped to the right organization. It also needs to be purgeable for HIPAA compliance. You are essentially building a per-client clinical knowledge base that the agent can read and write during execution.

Most teams start by stuffing everything into the system prompt. That stops working around month two.

2. Versioning Agents

When you deploy a new version of a microservice, you have a clear contract: the API schema. As long as the new version honors the same request/response shapes, you can deploy with confidence.

What is the "contract" for an agent? It is the combination of the system prompt, the model version, the tool definitions, the retrieval configuration, the guardrails, and the orchestration logic. Change any one of those, and the agent's behavior shifts. Sometimes in subtle ways that do not show up until a specific edge case hits production.

You need a versioning strategy that captures all of these dimensions. Not just "v1.2 of the prompt" but a composite version that pins the model, the tools, the retrieval index, and the behavioral guardrails together as a single deployable unit. And you need the ability to roll back to a prior version quickly when something goes wrong.

This is not a solved problem in any framework I have seen. Most teams are versioning prompts in Git and hoping for the best.

3. A/B Testing Agent Behaviors

Traditional A/B testing assumes you can measure a clear outcome metric and attribute it to a specific variant. With agents, the interaction surface is enormous. Did the agent perform better because of the prompt change, or because the clinician happened to ask simpler questions? The non-deterministic nature of LLM outputs means you need significantly larger sample sizes to reach statistical significance on behavioral changes.

You also need to decide what "better" means. Faster note generation? Higher clinician satisfaction scores? Fewer hallucinated skill targets? More accurate progress summaries? These metrics can conflict with each other. An agent that is more cautious will hallucinate less but also generate less useful clinical recommendations.

And there is the operational complexity: you need to route users consistently to the same variant across a multi-turn conversation. You need to log which variant produced which response. You need evaluation pipelines that can score agent outputs at scale, ideally with a mix of automated metrics and clinical review.

4. Identity, Tenancy, and Policy Control

In an ABA practice management platform, the agent helping a Registered Behavior Technician (RBT) document session data must operate under completely different access policies than the agent assisting a BCBA in writing a treatment plan or reviewing assessment results. Same platform, same agent codebase, different identity contexts. An RBT should never see billing data. A BCBA from Organization A should never see client records from Organization B.

This goes beyond API-key-level authentication. The agent needs to carry the calling user's identity through every tool invocation. If the agent queries a client's session history, that query must respect the user's role-based access and their organizational scope. If the agent generates a progress report, the audit log must attribute that action to the originating clinician, not to some service account.

Multi-tenancy adds another layer. In behavioral health, each organization may have different data residency requirements, different payer-specific documentation standards, and different feature configurations that affect which tools the agent can access. You need tenant isolation that is robust enough to survive prompt injection attempts, because a sufficiently creative user prompt could try to trick the agent into accessing cross-organization client data. In a HIPAA-regulated environment, that is not just a bug. It is a reportable breach.

5. Observability and Evaluation

You cannot grep your way to understanding why an agent made a bad decision. Traditional application monitoring tells you about latency, error rates, and throughput. With agents, you also need to understand the reasoning chain: what did the agent "think," what tools did it consider calling, why did it choose the path it chose, and where did the reasoning go wrong?

This requires trace-level observability that captures the full interaction graph: the clinician's input, the model's chain-of-thought (where available), each tool call with its inputs and outputs, the retrieval results that informed the response, and the final output. You need to be able to search and filter these traces by outcome quality, not just by technical metrics.

Evaluation is the other half of this. You need automated pipelines that score agent outputs against ground-truth datasets, and you need those pipelines to run continuously, not just during development. Model providers update their models. Retrieval indexes change as new clinical data arrives. The agent's behavior can drift without any code change on your side. When the agent is summarizing ABA session data or recommending program modifications, a subtle quality regression can directly affect patient care.

6. Secure Tool Integration

An agent's tools are its hands. A clinical session assistant agent might need to pull a learner's active programs from the practice management system, query historical session data to identify trends, look up assessment results (VB-MAPP, ABLLS-R, AFLS), retrieve the current treatment plan, and generate progress notes or parent-friendly summaries. Each of those integrations has its own authentication mechanism, rate limits, error modes, and data sensitivity classification.

You need a tool integration layer that handles credential management (rotating secrets, scoping permissions), input validation (preventing the agent from passing malformed or malicious parameters), output sanitization (ensuring tool responses do not leak PHI from one client into another client's conversation), and circuit breaking (stopping the agent from hammering a downstream service that is already degraded).

Most prototype agents have tools implemented as Python functions with hardcoded API keys. That is a security incident waiting to happen. In healthcare, it is a HIPAA violation waiting to happen.

The Cost of Building It Yourself

Every platform team's first instinct is to build their own agent hosting layer. I understand the impulse. You already have Kubernetes. You already have a CI/CD pipeline. How hard can it be to add a few containers that run agent loops?

The answer is that the easy part (running the inference calls) takes about a week. The hard part (everything listed above) takes six to twelve months and requires expertise in areas that most application teams do not have deep bench strength in: ML operations, security architecture for non-deterministic systems, evaluation pipeline design, and distributed state management for long-running conversational workloads.

I have watched three different teams attempt this in the last year. Two of them ended up with fragile internal platforms that required a dedicated team of four to six engineers just to keep running. The third abandoned their custom solution after a security audit revealed that their tool integration layer had no input validation, and an internal red team was able to use prompt injection to exfiltrate data from a staging environment.

The build-versus-buy calculus here is different from traditional infrastructure. With a database or a message queue, the problem is well-understood and the build option, while expensive, is bounded. With agent hosting, the problem space is still evolving. You are building on shifting ground.

Comparing the Major Agent Platforms

Three hyperscalers have now shipped production-grade agent hosting platforms. They have different architectural philosophies, different strengths, and different opinions about how agents should be built. Let me walk through each one, not as a feature checklist, but as an architectural assessment.

AWS Bedrock AgentCore

Amazon's approach with AgentCore reflects their infrastructure-first DNA. AgentCore is not a high-level agent builder. It is a runtime layer that provides the operational primitives you need to host agents at scale: managed execution environments, externalized memory and state, tool integration with IAM-scoped credentials, and a deployment model that supports versioned rollouts.

The architectural philosophy is "bring your own agent logic, we will handle the operational plane." This is similar to how Lambda works: you write the function, AWS handles the scaling, the execution environment, the observability integration, and the security boundary. AgentCore applies that same thinking to agent workloads.

Where this shines is in enterprises that already have deep AWS investments. The IAM integration is particularly strong. You can scope an agent's tool access using the same policy language you use for everything else in AWS, which means your existing security team can review and approve agent permissions using familiar patterns. The integration with CloudWatch, X-Ray, and CloudTrail means agent executions produce the same kind of audit trail that your compliance team already knows how to work with. For a HIPAA-regulated behavioral health platform, that operational maturity matters enormously.

The tradeoff is that AgentCore is lower-level than some teams want. If you are looking for a drag-and-drop agent builder, this is not it. You need to understand the runtime model, manage your own orchestration logic (or use the provided orchestration primitives), and integrate your own evaluation pipelines. The power is there, but the on-ramp is steeper.

Google Vertex AI Agent Builder

Google's approach comes from the opposite direction. Vertex AI Agent Builder is opinionated and vertically integrated. Google wants you to use their models, their retrieval stack (Vertex AI Search), their orchestration framework, and their evaluation tools. The result is a more cohesive developer experience if you are willing to buy into the full stack.

The architectural philosophy is "we will handle the hard parts of agent design so you can focus on your domain logic." This shows up in features like grounding (automatically connecting agent responses to authoritative data sources), built-in citation generation, and integrated evaluation that runs against Google's own quality benchmarks.

Where Google is particularly strong is in retrieval-augmented generation. Their search infrastructure is, unsurprisingly, very good, and the integration between Vertex AI Search and the agent runtime is tighter than what you get when wiring up a vector database to an agent framework manually. For use cases where the agent's primary job is to find and synthesize information from large document corpora (think: clinical practice guidelines, payer documentation requirements, published ABA research), this is a meaningful advantage.

The tradeoff is lock-in. Google's agent platform is deeply integrated with GCP services, and extracting yourself later is non-trivial. The platform is also more prescriptive about how agents should be structured, which can be limiting if your use case does not fit the patterns Google had in mind. Multi-cloud deployments are harder here than with the other two options.

Microsoft Azure AI Agent Service

Microsoft's approach reflects their enterprise distribution strength and the OpenAI partnership. Azure AI Agent Service is built around the concept that most enterprise agents will be powered by OpenAI models and need to integrate with the Microsoft productivity ecosystem (Office 365, Dynamics, SharePoint, Teams).

The architectural philosophy is "agents should be extensions of the tools your organization already uses." This is a pragmatic bet. Many ABA organizations run their administrative operations through Microsoft tools. If your clinicians coordinate through Teams, share parent training materials through SharePoint, and manage schedules through Outlook, Microsoft's integrations let agents surface directly in those workflows rather than requiring clinicians to context-switch to a separate interface.

Where Microsoft excels is in the breadth of the integration ecosystem. Azure AI Agent Service connects to the Microsoft Graph, which gives agents access to organizational data (calendars, contacts, documents, emails) with the user's existing permissions. For organizations where clinical documentation, scheduling, and team communication already run through Microsoft, this reduces the integration burden dramatically.

The tradeoff is that the platform's identity is still evolving. Microsoft has shipped multiple agent-related offerings (Copilot Studio, Azure AI Agent Service, Semantic Kernel, AutoGen) and the boundaries between them are not always clear. There is a consolidation happening, but as of today, choosing the right combination of Microsoft agent technologies requires more research than it should. The operational maturity of the runtime layer also lags behind what AWS offers with AgentCore, particularly around deployment versioning and rollback capabilities.

Putting It Together: A Realistic Example

Let me make this concrete. Imagine a behavioral health technology company, call them Lumen Health, that builds a practice management platform used by 200 ABA therapy organizations. They want to deploy an AI agent that assists BCBAs in generating client progress reports. Today, a BCBA spends 45 minutes to an hour per client per month reviewing session data, identifying trends, summarizing mastered and in-progress targets, and writing a narrative progress report that goes to parents and funding sources. Lumen wants to cut that to 10 minutes of review and approval.

The agent needs to:

  • Pull the learner's active treatment plan and current program list from the practice management system
  • Query three months of session data across all RBTs who worked with the client, including trial-by-trial data and behavior frequency counts
  • Retrieve the most recent assessment scores (VB-MAPP milestones, barriers, transitions) and compare them to the prior assessment period
  • Identify programs where the client has met mastery criteria, programs showing steady acquisition, and programs where data suggests the client is stuck or regressing
  • Generate a structured progress report that includes quantitative summaries, clinical interpretations, and recommended program modifications
  • Flag anything that looks clinically significant for the BCBA to review before the report is finalized

This is not a chatbot. It is an autonomous clinical workflow participant that synthesizes months of behavioral data into a document that funding sources use to make continued authorization decisions. If the agent hallucinates a mastered target that was never actually mastered, that is a clinical documentation integrity problem. If it misses a regression pattern, a child might continue on an ineffective program for another quarter.

With a DIY approach, Lumen's platform team would need to build: a stateful execution runtime (because report generation involves iterative clinician review over multiple sessions), a secure integration layer for their own clinical data APIs with per-organization credential scoping, an authorization layer that ensures the agent can only access data for clients assigned to the requesting BCBA within their organization, a HIPAA-compliant audit trail that logs every data access and generation event, a versioning system that lets them update the agent's clinical reasoning (maybe the latest ABA literature suggests different mastery criteria thresholds) without risking regressions in report quality, and an evaluation pipeline that continuously measures the agent's output against BCBA-reviewed gold-standard reports.

That is at minimum six months of focused platform engineering, and the resulting system would be Lumen's responsibility to secure, scale, and maintain across 200 organizations with different payer requirements and documentation standards.

With a managed platform, Lumen's team focuses on the domain logic: the clinical prompts, the tool definitions for their data APIs, the business rules for mastery criteria and regression detection, and the evaluation criteria that define what a good progress report looks like.

On AWS, this would look like an AgentCore-hosted agent with tool integrations scoped via IAM roles per organization, session state in DynamoDB, traces flowing to X-Ray for debugging cases where the agent misinterprets session data, and deployment versions managed through AgentCore's runtime. When a payer changes their documentation requirements, Lumen can deploy an updated agent version to the affected organizations without touching the rest. The HIPAA security review can leverage existing AWS BAA coverage and CloudTrail audit logs.

On Google, the agent would use Vertex AI Agent Builder with Vertex AI Search grounding the clinical interpretation against published ABA research and payer-specific documentation guidelines. The evaluation pipeline would use Vertex AI's built-in quality metrics for factual accuracy, supplemented with custom metrics comparing agent-generated reports against BCBA-authored reports.

On Azure, the agent would integrate naturally if Lumen's client organizations use Teams for clinical team coordination, surfacing draft reports and flagged items directly in the BCBA's daily workflow. The data integrations would use Azure API Management with managed identities and the healthcare-specific compliance controls Azure offers.

None of these is a clear winner in all dimensions. The right choice depends on Lumen's existing cloud footprint, their team's familiarity with each platform, the specific HIPAA and payer compliance requirements they face, and how much control they need over the clinical reasoning layer.

Where This Series Goes Next

This article covered the problem space: why agent hosting is hard, what the specific challenges are, and how the major platforms approach them differently. But we stayed at the architectural level. The real questions start when you zoom in.

How does an agent runtime actually manage versions and deployments? What does a safe rollout look like when the thing you are deploying can reason? How do you implement canary deployments for non-deterministic systems where the same input does not produce the same output? And how do you handle the fact that in healthcare, a bad rollout does not just mean degraded latency. It means potentially incorrect clinical documentation reaching a child's care team.

In the next article, we will go deep into AWS Bedrock AgentCore's runtime model and how it handles versions, deployments, and safe rollouts. We will look at the actual mechanics: how agent versions are pinned, how traffic shifting works, how rollback is triggered, and what the operational workflow looks like for a team shipping agent updates weekly. If you are evaluating AgentCore or just trying to understand what good agent deployment infrastructure looks like, that piece will give you the concrete details you need.


This is Part 1 of a series on production-scale agent hosting. The series is written for engineering teams who are past the prototype stage and need to make real infrastructure decisions about how to run agents reliably, especially in regulated domains like behavioral health where the stakes extend beyond uptime metrics.