Most teams do not discover LLM inference costs during the proof of concept.
They discover them after usage scales.
By then, prompts are longer, retrieval pipelines are heavier, retries are more frequent, and multiple product features may already depend on the same inference layer.
On AWS, this becomes more than a billing issue. It becomes a platform architecture problem.
Amazon Bedrock pricing depends on the model provider, model type, modality and inference mode, with options such as on-demand, batch and provisioned throughput depending on the model and use case. AWS also notes that batch inference is available for selected models at lower pricing than on-demand inference.
For self-hosted LLMs, the cost profile is different. The model may be open-weight, but the infrastructure is not free. GPU utilisation, autoscaling behaviour, model serving efficiency and operational overhead become the main cost drivers.
The real question is not simply:
Which model is cheaper?
The better question is:
Is our inference architecture designed to control cost as usage grows?
LLM cost does not scale like a normal API cost.
Request volume matters, but it is only one part of the picture. Production costs grow across tokens, retrieval, concurrency, retries, routing and infrastructure capacity.
| Cost Driver | Why It Matters in Production |
|---|---|
| Input tokens | Long prompts, conversation history and retrieved context increase every request cost. |
| Output tokens | Long generated answers can quietly increase spend. |
| Retrieval | Vector search, reranking and context assembly add hidden cost before the model is called. |
| Retries | Failed or slow requests may multiply token usage. |
| Concurrency | More simultaneous users can require higher throughput or provisioned capacity. |
| Model routing | Using the same powerful model for every task wastes budget. |
| Self-hosted capacity | GPU-backed infrastructure becomes expensive when utilisation is low. |
The problem is not that LLMs are expensive by default.
The problem is that production usage often grows before the inference path is designed.
A proof of concept may involve one prompt, one model and one user journey. A production system may involve query rewriting, retrieval, ranking, prompt assembly, guardrails, retries, fallback logic, logging and tenant-level reporting.
That is why LLM cost optimisation has to start at the architecture level.
Instead of looking only at model pricing, engineering teams should look at the full inference path.
| Area | Common Cost Problem | Practical Fix |
|---|---|---|
| Context window | Too much history or irrelevant context is sent to the model. | Summarise sessions, trim history and separate static from dynamic prompt content. |
| RAG pipeline | Too many chunks are retrieved and passed into the prompt. | Use metadata filtering, reranking, deduplication and stricter top-k limits. |
| Output length | The model generates longer answers than the task requires. | Define output budgets by workload type. |
| Model choice | Simple tasks use expensive reasoning models. | Route classification, extraction and tagging to cheaper paths. |
| Embeddings | Content is re-embedded too often. | Re-embed only when documents change or when index refresh is required. |
| Retries | Failed calls are retried without cost controls. | Add retry budgets and route-aware fallback logic. |
| GPU capacity | Self-hosted models run on underused infrastructure. | Track utilisation and separate real-time from batch workloads. |
This is where many teams underestimate cost.
They look at the model price, but the real spend is created by repeated context, weak retrieval, excessive output, poor routing and invisible retry behaviour.
A smaller model may help, but only if the request path is already under control.
Amazon Bedrock and self-hosted LLMs do not only differ by pricing. They create different cost risks.
| Area | Amazon Bedrock | Self-Hosted LLMs on AWS / Kubernetes |
|---|---|---|
| Main cost risk | Token growth and model usage | GPU utilisation and serving efficiency |
| Infrastructure ownership | Lower | Higher |
| Scaling concern | Throughput, quotas, token volume | Node capacity, autoscaling, batching, cold starts |
| Best fit | Variable workloads, faster platform adoption, managed model access | Stable high-volume workloads, stronger runtime control |
| Hidden risk | Untracked prompts, retries, RAG overhead | Idle GPUs, operational complexity, poor utilisation |
| Cost control focus | Token visibility, routing, caching, batch inference | GPU scheduling, batching, autoscaling, workload isolation |
Bedrock can reduce infrastructure ownership, but it does not remove the need for token governance.
Self-hosting can reduce unit cost at scale, but only when utilisation is high enough to justify the operational burden.
For Bedrock workloads, AWS provides CloudWatch metrics such as invocations, latency, throttles, input token count, output token count, time to first token and estimated tokens-per-minute quota usage. These metrics are important because cost, latency and quota pressure need to be visible together in production.
For self-hosted or SageMaker-based inference, capacity management becomes more important. AWS introduced scale-to-zero support for SageMaker inference endpoints using inference components, which can reduce cost during periods of low or no usage, but teams still need to consider cold starts, failed requests during scale-out and application latency tolerance.
Cost optimisation should not start after the bill arrives. It should be designed into the inference path.
| Pattern | What It Does | Cost Impact |
|---|---|---|
| AI gateway | Centralises model access, routing, budgets and policy. | Prevents scattered, uncontrolled model calls. |
| Workload-based routing | Sends simple tasks to cheaper paths and complex tasks to stronger models. | Reduces unnecessary use of expensive inference. |
| Prompt caching | Reuses repeated static context where supported. | Reduces repeated input token cost and latency. |
| Selective RAG | Runs retrieval only when needed and sends fewer, better chunks. | Cuts retrieval overhead and input token volume. |
| Batch inference | Moves non-urgent workloads out of the real-time path. | Improves throughput and avoids unnecessary real-time cost. |
| Token budgets | Sets limits by tenant, workflow, route or environment. | Prevents one feature or tenant from consuming uncontrolled spend. |
| Retry controls | Limits repeated failed calls and fallback overuse. | Stops reliability logic from multiplying cost. |
Amazon Bedrock prompt caching is designed for workloads with long, repeated context, such as document-based chat or repeated system instructions. AWS states that prompt caching can reduce inference response latency and input token costs for supported models, but cache effectiveness depends on keeping the reusable prompt prefix stable between requests.
Amazon Bedrock batch inference is another useful pattern for non-interactive workloads. It lets teams submit multiple prompts asynchronously and write outputs to Amazon S3, which is useful for large-scale processing where real-time response is not required.
A production AI platform should not let every application decide which model to call.
A better pattern is to centralise routing logic through an AI gateway or inference service.
Request
→ Identify tenant
→ Classify task
→ Check latency requirement
→ Check data sensitivity
→ Check token budget
→ Select model route
→ Log cost per outcome
This allows the platform to route different workloads differently.
| Workload | Better Route |
|---|---|
| Email classification | Structured output route |
| Support ticket tagging | Lower-cost classification route |
| Product Q&A | RAG route with selective retrieval |
| Long report generation | Async batch route |
| Complex financial analysis | Stronger model with strict token budget |
| Sensitive workflow | Restricted route with stronger audit controls |
The goal is not to use the cheapest model everywhere.
The goal is to avoid using expensive inference where it does not create enough value.
RAG can improve answer quality, but it can also increase cost if retrieval is triggered for every request.
A cost-aware RAG path should decide whether retrieval is needed before assembling context.
User query
→ Does this need retrieval?
→ Select data source
→ Apply metadata filter
→ Retrieve top relevant chunks
→ Rerank / deduplicate
→ Send compact context to model
Many RAG systems become expensive because they pass too much context into the model.
A better approach is to reduce context before inference:
| RAG Problem | Cost-Aware Fix |
|---|---|
| Retrieval runs for every query | Classify whether retrieval is needed first. |
| Too many chunks are retrieved | Use stricter top-k limits by workflow. |
| Chunks are too broad | Improve chunking and metadata quality. |
| Context contains duplicates | Deduplicate before prompt assembly. |
| Retrieved context is weak | Use reranking before model invocation. |
| Full documents are passed into prompts | Send only the relevant sections. |
This connects directly with production RAG architecture. Bion’s RAG guidance already frames RAG as a distributed systems problem, where retrieval quality, latency, observability and cost need to be designed together rather than treated as a simple model call.
Practical Ways to Reduce LLM Inference Cost
The biggest savings usually come from the request path, not from changing models first.
Start by removing unnecessary input and output tokens.
Practical changes include:
For example, a support assistant may not need the last 20 messages in full. It may only need the current question, the latest user intent and a short session summary.
That reduces input tokens without changing the user experience.
Do not run RAG for every request.
Some requests can be answered from:
Retrieval should be triggered when it adds value, not because it is part of a fixed pipeline.
Classification, tagging, routing and extraction should not use the same inference route as complex reasoning.
A high-volume classification workflow should usually produce a short structured response, not a long natural language answer.
Example:
{
"category": "billing_issue",
"priority": "high",
"confidence": 0.87,
"next_action": "route_to_finance"
}
This reduces output tokens, improves automation and makes validation easier.
Bion’s Amazon Bedrock email classification case study is a useful example of how Bedrock can be used in a practical workflow with AWS services around it, rather than as an isolated model call.
Not every LLM workload needs real-time inference.
Good candidates for batch processing include:
Batch processing gives teams more control over timing, throughput and cost. It also prevents background workloads from competing with user-facing requests.
Retries are useful for reliability, but dangerous for cost.
A failed request that is retried three times may cost four times more than expected. Fallback routes can create the same problem if they are not controlled.
Use retry rules such as:
Retry once for temporary failure
Do not retry if token budget is exceeded
Use cheaper fallback for low-priority tasks
Use stronger fallback only for business-critical workflows
Log retry cost separately
Reliability logic should not create uncontrolled spend.
Total token usage is useful, but it is not enough.
The more useful metric is cost per successful outcome.
Track cost per:
This helps engineering, product and finance teams see which AI features are economically sustainable in production.
Most LLM cost problems are not caused by one bad decision.
They come from small uncontrolled patterns that accumulate.
| Underestimated Area | Why It Matters |
|---|---|
| Observability | Without token, latency, retry and route visibility, teams cannot see where cost is created. |
| Retrieval cost | RAG cost includes embedding, indexing, vector search, reranking and context assembly, not only model inference. |
| Retry and fallback cost | Reliability logic can silently multiply spend if every failed request triggers another model call. |
| Tenant attribution | SaaS teams need to know which customer, feature or workflow creates the cost. |
| Idle capacity | Self-hosted inference can become expensive when GPU infrastructure is underused. |
AWS has continued to improve operational visibility for Bedrock inference workloads, including CloudWatch metrics for time to first token and estimated tokens-per-minute quota usage. These metrics help teams monitor perceived responsiveness and quota pressure, not only total token count.
For production teams, the key question is not only:
How many tokens did we use?
It is:
Which feature, tenant or workflow created that cost, and did it produce value?
Amazon Bedrock and self-hosted LLMs change different parts of the production operating model.
Bedrock reduces the burden of model infrastructure and fits naturally into AWS governance patterns. Self-hosted LLMs give teams deeper control over execution, performance and deployment topology. We covered these trade-offs in more detail in our guide to Amazon Bedrock vs self-hosted LLMs in production.
But neither option removes the need for platform design.
The teams that succeed with GenAI in production will not be the ones that simply pick the strongest model. They will be the ones that can control how models are accessed, how workloads are routed, how sensitive data is handled, how cost is attributed and how failures are observed.
This is why GenAI observability needs to cover more than logs and latency. Production teams need visibility across tokens, retrieval behaviour, model routes, retries, fallbacks, tenant usage and cost per outcome.
Cost control should also sit alongside governance, security and operational controls when teams scale responsible GenAI on AWS.
In production, the model matters. The system around the model matters more.
If your AI product is moving beyond prototype stage, Bion can help you design the AWS platform, model routing, observability and governance layer needed for production. Our LLM deployment on AWS service helps teams build production-ready GenAI platforms with Amazon Bedrock, RAG pipelines, Kubernetes-based inference workloads and cost-aware architecture.
Book a technical strategy call to review your AWS LLM architecture and inference cost model.