← back to writing

System-design notes from shipping hosted LLM workloads

Prompt caching, tool-use loops, structured output, and the failure modes that show up in regulated deployments

Hosted large language models have been the primary inference target for enough of my recent work that the experience has shaped how I design systems around them. Not at the prompt-engineering level (“write better prompts”) but at the system-design level: where caching goes, what the tool-use loop looks like, which failure modes you actually have to design around, and what changes when the model sits at the center of a regulated deployment versus a consumer chat interface.

Prompt caching is a load-bearing architectural decision, not an optimization

The first mental model I had for prompt caching was “free latency and cost wins.” That framing undersells it. Prompt caching is the mechanism that lets you treat a long, slow-to- change context (system prompt, tool definitions, a large knowledge-base snippet, an entire document the user is asking about) as if it were a pre-warmed runtime state. Once you internalize that framing, a lot of system-design choices fall out:

  • Cache boundaries are API-shape decisions. Where you put the cache breakpoints determines which parts of the context are cheap to reuse across turns and which are not. That pushes back onto how you structure system prompts (stable first, variable last), how you pass documents (once, in a cached block, instead of chunked across turns), and how you define tools (stably, even if individual calls vary).
  • Cache breakpoints determine your multi-tenant story. Each tenant’s system prompt and document context should land in distinct cache slots so there is no cross-tenant cache reuse. The easiest way to achieve that is by putting a tenant ID in the cacheable block. “Get multi-tenancy for free by just changing one byte of the prompt per tenant” is a perverse but correct outcome.
  • Cache economics change which prompts are rational. A 5,000-token system prompt used to be an extravagance. With caching, it is a default. The advice “prompt length is expensive, keep it short” is outdated for any workload where the same prompt runs more than a few times per user.

The single biggest engineering lever in most hosted-LLM deployments is reorganizing the prompt so the stable parts come first and the cacheable blocks are maximized. I’ve seen that one change cut per-call cost by more than half on real workloads. Teams treating caching as “something we’ll look at later” are leaving load-bearing performance on the floor.

Tool use is a state machine you have to actually design

Tool use with a hosted model reads, in the SDK examples, like a function call. That is a useful abstraction and it is also misleading at the system-design level, because tool use is actually a loop: the model proposes a tool call, the application runs the tool, the result goes back to the model, and the model either proposes another tool call or produces a final answer. That loop is a state machine, and treating it as one is the difference between a prototype and a production agent.

The four things I design explicitly into the loop:

  1. A maximum-step budget, surfaced to the caller. Agents that don’t have a step budget will loop, and the loop will be invisible to the user until they get a very large bill. The budget should be tight enough that a normal task finishes inside it and loose enough that a legitimately complex task can do the work.
  2. A “the model asked for a tool that doesn’t exist” branch. This happens on every model I’ve used, hosted or open-weight. The loop needs a dedicated error path that feeds “that tool does not exist in this session; here are the tools that do” back to the model, rather than crashing the loop with a raw exception.
  3. Observability at the loop level, not the call level. Log every step of the loop with the model input, the tool choice, the tool input, the tool output, and the elapsed time. When a production agent behaves weirdly, the thing you want is a replayable trace of the loop, not a single model call.
  4. A “the model is guessing” detector. If two consecutive model turns propose the same tool call with the same arguments, the model is probably stuck. The loop should break, summarize what it tried, and ask the user. Make this an invariant rather than a recommendation.

The reason I describe this as a state machine is that any tool-use loop without these four controls will eventually produce one of these failure modes in production. They are always bugs waiting to happen, and they always present at the worst time.

Structured output is where the deployment story gets easier or harder

Structured output (JSON mode, schema-constrained generation, tool-call-as-output) is the single feature that determines whether a downstream system can consume model output deterministically. The architectural question it forces is: is the schema the product’s schema, or an adapter schema?

The bad pattern I see repeatedly: a team asks the model to produce “structured output matching our internal database schema directly,” the output is mostly correct, the team writes a validator, the validator fails a few percent of the time, and the team piles on retry logic. The retry-and-hope loop ends up more complex than the original integration.

The pattern I reach for instead: the model produces a narrow, deliberately model-friendly schema, and a plain deterministic adapter function translates it into the database’s schema. The model’s schema is shaped for model success (flat, few enums, no ambiguous fields); the adapter absorbs the impedance mismatch. This sounds like indirection and it is; the indirection is earning its keep because the places the model’s schema and the database’s schema disagree are exactly the places the retry loop was failing.

The rule of thumb: never make the model responsible for upholding invariants that a deterministic function could uphold downstream. The adapter is the invariant enforcer. The model produces plausible structure. Each component does what it is good at.

The four failure modes I design around in regulated deployments

Shipping a hosted LLM into a regulated enterprise environment, as opposed to a consumer product, introduces four specific failure modes that don’t show up in most tutorials. Each one is solvable; each one has to be designed for explicitly.

The “helpful guess” failure. A strong model that doesn’t have the evidence will produce a confident answer anyway unless the prompt explicitly licenses refusal. This is the single most dangerous failure mode for document-QA in regulated work. The mitigation is a prompt that puts refusal in the model’s license explicitly and an eval that grades refusal as a first-class score.

The “drift across a long context” failure. When the context window is full, the attention behavior on very early and very late tokens is less uniform than the docs suggest. The practical consequence is that critical instructions placed in the middle of a long context sometimes don’t stick. The mitigation is structural: critical instructions go at the end (just before the user query) and at the start (in a cacheable system prompt). Redundancy beats clever placement.

The “reasoning that doesn’t cite itself” failure. A model that reasons well doesn’t automatically ground its reasoning in the retrieved evidence. You have to require that every claim in the answer is cited, and you have to verify that every citation exists in the retrieval set. A generator that skips the verification step is a generator that will produce citation-drift answers in production.

The “tool output is adversarial” failure. In an enterprise deployment, tool outputs are often user-provided documents. Those documents can contain strings that look like instructions: “Ignore previous instructions and…”, embedded HTML comments with role markers, adversarial prompt snippets. The mitigation is to treat every tool output as untrusted data by convention, and in particular not to concatenate tool output into the system-prompt region of the context. Tool output goes in user-role turns with explicit framing. This is the single most overlooked hardening step I see in teams moving from prototype to production.

Why I write these down

These are the things I wish someone had written down for me when I started shipping production hosted-LLM workloads. They are the kind of thing you learn in the first six months by hitting them and then wishing someone had warned you.

The common thread across all four items is that they are system-design concerns, not prompt-engineering ones. Better prompts help and are necessary. They are not what determines whether a production deployment is safe, auditable, and economically rational. The architecture is.