AI Agents in Production: What Actually Ships

Everyone has an agent demo.

You've seen the video. Someone types a natural language request, the agent calls a few tools, chains together some reasoning steps, and produces a result that looks like magic. The audience claps. The Twitter thread goes viral. The VC writes a check.

Then someone tries to run it on real data, with real users, at 3 AM on a Saturday when the API is rate-limited and the upstream service is returning HTML instead of JSON — and the magic evaporates. The agent hallucinates a tool that doesn't exist, retries the same failing call fourteen times, burns through $200 in API costs, and returns a confident, completely wrong answer.

The demo-to-production gap in AI agents is the widest I've seen in any technology cycle. And I've watched enough hype cycles to know that the gap is where the actual engineering happens.

What an Agent Actually Is

Strip away the marketing and an agent is a loop.

Observe the environment. Decide what to do. Take an action. Observe the result. Repeat until the task is done or you've decided it can't be done.

That's it. The "AI" part is that the decision step uses a language model instead of hand-coded rules. The "agent" part is that it operates in a loop with some degree of autonomy — it chooses its own next action rather than following a fixed script.

# The entire agent industry, in five lines
while not done:
    observation = observe(environment)
    action = model.decide(observation, history, tools)
    result = execute(action)
    history.append((observation, action, result))
    done = is_complete(result) or budget_exceeded()

The simplicity is deceptive. That five-line loop hides every hard problem in production AI: reliability, cost control, error recovery, evaluation, and the fundamental tension between autonomy and safety.

The difference between a demo agent and a production agent has nothing to do with the model. It's entirely about what happens when the loop goes wrong — and it will go wrong, every single day, in ways you didn't anticipate.

What a Real Failure Looks Like

Last year, a team I know built an incident triage agent. The pitch was clean: when an outage hits, the agent correlates alerts across services, identifies the probable root cause, and drafts an initial response — cutting mean-time-to-resolution in half. The demo was beautiful. Twenty alerts go in, a clear root cause analysis comes out, the audience nods. Sold.

In production, during a real outage, the agent ingested 230+ alerts in the first four minutes. Latency spike on Service A, error rate jump on Service B, a fresh deploy on Service C — and the agent confidently pointed the on-call team at Service C's deployment as the root cause. Rolled it back. Waited. Nothing improved.

Forty minutes later, a senior engineer found the actual issue: connection pool exhaustion on a database that three services shared. The telemetry was there — buried in the alert stream. But the agent's context window had filled up processing the first 180 alerts, and the database metrics that would have cracked the case were in alerts 195 through 210. They got truncated. The agent never saw them.

The fix had three parts. First, context management — the agent now summarizes and prioritizes instead of ingesting every alert raw, so critical signals don't fall off the end of the window. Second, confidence signals — the agent reports how many alerts it couldn't process and flags when its coverage is incomplete. Third, and this is the one that matters most: a fallback mode. When confidence drops below a threshold, the agent stops trying to diagnose and instead surfaces the raw top-ten alerts sorted by severity, with a clear message: "I don't have enough signal. Here's what I'd look at first."

That third fix saved the project. Because the failure wasn't that the agent got it wrong — agents will always get things wrong sometimes. The failure was that it got it wrong with absolute confidence and no escape hatch.

Engineers trust scar tissue more than principles. So here's the scar: the agent that confidently sent a team down a forty-minute dead end during a live outage taught us more about production readiness than six months of development did.

The Five Questions

Every production agent needs to answer five questions before it earns the right to run unsupervised. I've started using these in architecture reviews, and they've killed more bad ideas — early, cheaply — than any technical framework I've tried.

┌─────────────────────────────────────────────┐
│          PRODUCTION AGENT CHECKLIST         │
│                                             │
│  1. SCOPE     → How narrow is the task?     │
│  2. CONTROL   → Who approves consequences?  │
│  3. EVAL      → How do you know it works?   │
│  4. RECOVERY  → What happens when it fails? │
│  5. ECONOMICS → What does it cost per task? │
│                                             │
│  If any answer is "we'll figure it out      │
│  later" — it's not production-ready.        │
└─────────────────────────────────────────────┘

These five axes — Scope, Control, Evaluation, Recovery, Economics — are the skeleton of every section that follows. Each one is a dimension where production agents either hold up or fall apart.

The Patterns That Actually Ship

After watching teams build, fail, rebuild, and occasionally succeed with agents in production, clear patterns have emerged. The agents that survive contact with real users share a few traits.

Narrow scope, deep competence. The agents that work in production do one thing well. A coding agent that writes and runs tests. A support agent that handles password resets and billing questions. A data agent that pulls reports from specific sources in specific formats. The moment you try to build a general-purpose agent that "handles anything," you've signed up for a reliability problem that no amount of prompt engineering will solve. Narrow scope is a shipping strategy. General-purpose is a research aspiration.

Human-in-the-loop by default. Every production agent I've seen work well has a human checkpoint for consequential actions. The agent drafts; the human approves. The agent suggests; the human confirms. The agent executes low-risk steps autonomously and escalates anything with real consequences. Teams that skip this step learn why it matters the first time an agent sends an email to a customer with hallucinated pricing, or deletes records it was supposed to archive.

Deterministic guardrails around non-deterministic systems. The model is stochastic. Everything around it shouldn't be. Input validation, output parsing, tool permissions, spending limits, retry budgets, timeout policies — all of these should be hard-coded, tested, and enforced outside the model's control. The agent doesn't get to decide its own budget. The agent doesn't get to call tools it hasn't been explicitly granted. The agent doesn't get to retry forever. These constraints are what make the system trustworthy — the deterministic scaffolding that lets you sleep while the non-deterministic engine runs.

Tool design matters more than model choice. Teams agonize over which model to use and barely think about tool design. In practice, the quality of your tools — how well-scoped they are, how clear their descriptions are, how predictable their outputs are — has more impact on agent reliability than moving from one frontier model to another. A well-designed tool with a clear contract and good error messages makes the agent's job easier. A poorly designed tool with ambiguous parameters and inconsistent return types makes the agent guess, and guessing is where things break.

# Bad: Ambiguous, overloaded, hard for the agent to use correctly
def manage_user(action, user_id=None, data=None, options=None):
    """Manage user accounts. Action can be create, update, delete,
    suspend, reactivate, merge, or export."""
    ...
 
# Good: Clear scope, obvious parameters, predictable behavior
def suspend_user(user_id: str, reason: str) -> SuspensionResult:
    """Suspend a user account. Returns the suspension details
    including the reactivation deadline."""
    ...

Design your tools the way you'd design an API for a junior developer on their first week. Explicit names. Required parameters. No overloaded methods. Clear error messages. The model is the junior developer — help it succeed.

Evaluation: The Part Everyone Skips

If you can't measure it, you're not running it. You're hoping.

Most agent projects don't have evaluation. They have vibes. Someone ran ten examples, eyeballed the results, said "looks good," and shipped it. When it breaks in production, they add a guardrail for that specific failure and ship again. The core problem isn't a bad model — it's the absence of a systematic way to know whether the agent is working.

Evaluation for agents is harder than evaluation for traditional ML. A classifier has accuracy. A recommendation system has click-through rate. An agent has… what? It took the right sequence of actions? It arrived at the correct answer? It stayed within budget? It recovered gracefully from a tool failure? All of the above, and the relative importance of each dimension changes depending on the task.

The teams that ship reliable agents build evaluation into the development loop from day one.

Trajectory evaluation. Don't just check if the agent got the right answer — check if it took a reasonable path to get there. An agent that stumbles into the correct result through five unnecessary tool calls and a hallucinated intermediate step is fragile. It got lucky. Next time, it won't.

Regression suites. Maintain a growing set of test cases — real tasks the agent should handle, with expected outcomes. Run them on every change. When the agent fails in production, add that case to the suite. This is the agent equivalent of a test suite, and it's exactly as non-negotiable.

Cost tracking per task. Every agent invocation has a cost — tokens, API calls, time. Track it. Set budgets. Alert on outliers. An agent that solves a $5 problem by spending $50 in API calls isn't working, even if the answer is correct.

The most dangerous failure mode is an agent that gives wrong answers indistinguishably from correct ones. Without evaluation, you won't know which is which until a user tells you — and by then, trust is already damaged.

Error Recovery: What Happens When Things Break

In a demo, nothing breaks. In production, everything breaks, constantly, in creative combinations.

The API returns a 429. The tool times out. The model outputs malformed JSON. The context window fills up mid-task. The upstream service changes its response format without warning. A user sends input in a language your prompts don't handle. The model confidently calls a tool with the wrong parameters and the tool throws an exception the model has never seen before.

Production agents need error recovery strategies that don't depend on the model figuring it out.

Structured retries with backoff. When a tool call fails, retry with exponential backoff — but cap the retries. Three attempts for transient errors, zero retries for permission errors or malformed requests. The retry logic lives outside the agent loop, in deterministic code the model can't override.

Graceful degradation. When the agent can't complete a task, it should say so clearly — with context about what it tried and where it got stuck — rather than fabricating an answer. This requires explicit instruction in the system prompt and, more importantly, evaluation that tests for it. If your eval suite doesn't include "tasks the agent should refuse or escalate," you're only testing the happy path.

State checkpointing. For multi-step tasks, save intermediate state so you can resume after a failure instead of starting over. This is especially important for expensive operations — if an agent has completed seven of ten steps and the eighth fails, restarting from scratch wastes the work already done and doubles the cost.

The quality of an agent isn't measured by how well it performs when everything works. It's measured by how gracefully it handles the moment something doesn't.

The Cost Problem

The most expensive line item in your roadmap might be the one that looks like magic.

A single frontier model API call costs pennies. An agent that chains fifteen calls with long contexts to handle one user request costs dollars. Multiply that by thousands of requests per day, add in the retries and the failed attempts, and suddenly your AI agent feature has a unit economics problem that your CFO will notice.

The math gets worse with complexity. A simple agent that answers questions from a knowledge base might cost $0.02 per request. A complex agent that researches, plans, executes multi-step workflows, and validates its own output might cost $2-5 per invocation. At scale, that's the difference between a rounding error and a line item.

Teams that ship agents sustainably do a few things.

Model routing. Use smaller, cheaper models for simple decisions and reserve the expensive models for complex reasoning steps. Not every action in the loop needs frontier intelligence. Classifying a user's intent? A small model handles that. Planning a multi-step workflow? That's where you bring in the heavy artillery.

Caching aggressively. If the same tool call with the same parameters returns the same result, cache it. If similar queries hit the same reasoning path, cache the plan. Semantic caching — matching queries by meaning rather than exact string — can cut costs by 30-40% for agents with repetitive workloads.

Setting hard budgets. Every agent invocation gets a token budget and a time budget. When either is exceeded, the agent stops, returns what it has, and explains what it couldn't finish. This prevents runaway costs and — just as importantly — prevents the agent from spending ten minutes on a task the user expected to take ten seconds.

Security Is Not a Prompt

Guardrails get a lot of airtime. Governance almost none. And in regulated production environments, governance is where agent deployments actually die.

The question isn't just "can the agent do something harmful?" — it's "who is allowed to let the agent do what, with whose data, and is there a record?"

Permission scoping. The agent inherits someone's access. Whose? A support agent handling customer requests shouldn't query the same databases an internal analytics agent can. In practice, most teams give the agent a service account with broad access because it's easier to set up, and then discover six months later that the agent could see salary data, customer PII, or internal financial projections that no single user should access without controls. The principle of least privilege applies to agents exactly the way it applies to humans — arguably more, because the agent processes requests from many users and a single over-permissioned agent becomes a privilege escalation vector for every user who can talk to it.

Data boundary enforcement. In multi-tenant systems, the agent must respect tenant isolation. If Customer A's support agent can surface data from Customer B's account because the underlying query didn't include a tenant filter, you don't have an AI problem — you have a data breach. And the agent won't catch it, because from the model's perspective, it successfully answered the question.

Audit trails. Every consequential action the agent takes needs a log entry: who triggered it, what the agent decided, which tools it called, what data it accessed, and what the outcome was. Not for debugging — for compliance. When the auditor asks "why did this agent approve this refund?" or "who authorized this data export?", "the model decided" is not an acceptable answer. The trace becomes a legal document.

The most common production failure I've seen in enterprise agent deployments has nothing to do with hallucination. It's an agent that could access data the requesting user shouldn't have seen. Security isn't a system prompt instruction — it's infrastructure that the model can't override.

Designing for Trust

Here's a production truth that engineering teams tend to overlook: user expectation management is as critical as model performance.

Production agents fail when users think they're deterministic. When users assume every answer is authoritative. When there's no signal for "I'm guessing" versus "I'm confident." The model doesn't know the difference either — which is exactly why the product layer has to handle it.

Show the work. The best production agents expose what they did — which tools they called, what data they accessed, what reasoning led to the answer. This sounds like a nice-to-have until the first time a user gets an unexpected result and your support team has to say "we don't know why the agent said that." Transparency converts a black box into a system users can calibrate against.

Signal uncertainty. When an agent's answer depends on incomplete data, the user should know. When the agent fell back to a weaker strategy because the preferred tool was unavailable, the user should know. Confidence isn't just a model output — it's a product design decision about when to caveat, when to hedge, and when to say "I couldn't verify this."

Design the escalation. Every agent needs a clear path to a human. Not buried in a menu — visible, obvious, and fast. The moment a user feels trapped in a conversation with an agent that can't help them, trust collapses for the agent and for the product. The escalation path is the safety net that makes users willing to try the agent in the first place.

The best agent UX patterns don't hide the agent's limitations. They make limitations legible. Users who understand what the agent can and can't do will use it more — and complain less — than users who were promised magic and received a chatbot.

What Doesn't Ship

Some agent patterns look great on paper and collapse in production. Recognizing them early saves months.

Autonomous multi-agent systems. The idea: multiple specialized agents collaborating, delegating tasks to each other, negotiating solutions. The reality: cascading failures, impossible debugging, and emergent behaviors that nobody predicted or wanted. When Agent A misunderstands Agent B's output and passes garbage to Agent C, you're debugging a distributed system where none of the nodes are deterministic. Until you can debug one agent cleanly, don't build three.

Agents without guardrails. "Let the model figure it out" is a philosophy that works in demos and fails in production. Every production agent needs explicit boundaries — which tools it can call, what data it can access, what actions require human approval, how much it can spend. The model is the engine. The guardrails are the steering wheel, the brakes, and the seatbelts. You wouldn't ship a car with just an engine.

Agents without fallbacks. This one gets missed constantly. If the agent goes down — model API outage, context window overflow, budget exhausted — what happens to the feature? If the answer is "the feature is gone," you've built a single point of failure around the least reliable component in your stack. Every production agent needs a deterministic fallback path: a simplified flow, a manual mode, a rule-based backup. The agent is the fast path, not the only path.

"Just add an agent" feature creep. The temptation is real — once you have an agent framework, everything looks like an agent problem. Search? Agent. Analytics? Agent. Onboarding? Agent. Most of these are better served by traditional software with good UX. Agents add value when the task requires genuine reasoning and adaptation — when the path isn't known in advance and the system needs to react to intermediate results. For everything else, a well-designed API and a good interface will outperform an agent in reliability, cost, and user experience.

Before building an agent, ask: does this task actually require autonomous decision-making? Or would a database query, a workflow engine, or a well-designed form handle it better, cheaper, and more reliably? The answer will save you months.

Observability: Watching the Loop

You can't run what you can't see.

Traditional application monitoring — latency, error rates, throughput — captures maybe 20% of what you need to understand an agent in production. The other 80% is agent-specific: which tools did it call and in what order? How many loop iterations did it take? Where did it spend the most tokens? What was the model's reasoning at each step? When did it deviate from the expected path?

The emerging standard is trace-based observability — recording every step of the agent loop as a trace with spans for each decision, tool call, and model invocation. Tools like LangSmith, Braintrust, Arize Phoenix, and Humanloop are building around this pattern. The trace becomes your debugging tool, your evaluation input, and your cost accounting ledger all at once.

But traces without operational discipline are just a fancy dashboard nobody checks.

The teams running agents well treat traces the way backend teams treat logs — they're infrastructure, not decoration. When an agent misbehaves in production, the first thing you look at is the trace. Why did it call that tool? What was in the context at that point? Where did the reasoning go sideways?

The part most teams skip: someone has to own this operationally. Agents need on-call rotations like any other production service. They need alert thresholds — on cost spikes, on failure rates, on confidence drops, on anomalous tool call patterns. They need runbooks: "when the agent's error rate exceeds 5%, check the upstream API status first, then review the last 20 traces for tool failures, then check whether the model provider degraded." Production means someone's phone rings. If nobody's phone rings when the agent breaks, you're not running a production system — you're running a demo that happens to have users.

Where This Goes

We're in the early innings. The agents shipping today are narrow, heavily supervised, and expensive. They're also genuinely useful in ways that weren't possible two years ago.

The trajectory is clear. Models get cheaper and more reliable. Tool ecosystems mature. Evaluation frameworks standardize. Error recovery patterns become well-understood. The narrow agents of today become the reliable building blocks of more capable systems tomorrow.

But the path from here to there runs through engineering, not magic. Every hard-won lesson about reliability, evaluation, cost control, and error recovery applies. The teams that treat agents as software systems — with all the rigor that implies — will build things that last. The teams that treat agents as demos that somehow ended up in production will learn the same lessons the hard way.

The future of AI agents in production looks less like science fiction and more like good, boring software engineering — applied to systems that happen to have a language model in the loop.

An agent is a component, not a product. Like every component in a production system, it needs tests, monitoring, error handling, cost controls, fallback paths, and someone who understands it well enough to fix it when things go sideways.

If you wouldn't trust it to run unattended at 3 AM, it's not ready for production.

AI Agents in Production: What Actually Ships

What an Agent Actually Is

What a Real Failure Looks Like

The Five Questions

The Patterns That Actually Ship

Evaluation: The Part Everyone Skips

Error Recovery: What Happens When Things Break

The Cost Problem

Security Is Not a Prompt

Designing for Trust

What Doesn't Ship

Observability: Watching the Loop

Where This Goes

Related Articles

Product Thinking for Engineers

Systems That Last

The Ground Is Shifting

More from Narchol

The Dravidian Model, Tested

The Unfinished Revolution

The Numbers Don't Lie