RAG Architecture on AWS: What Actually Works in Production

Written by Bion DevOps Team | Apr 24, 2026 1:17:22 PM

Most RAG systems don’t fail because of the model.

They fail because the surrounding system is not designed for production.

Early implementations usually follow a simple pattern:

documents embedded
stored in a vector database
retrieved and passed to an LLM

This works in controlled environments.

In production, the same setup starts to degrade:

retrieval quality drops
latency becomes unpredictable
responses lose consistency
debugging becomes difficult

RAG is not a model problem. It is a distributed systems problem.

Where Production RAG Systems Actually Break
Retrieval Quality Degrades Over Time

Initial indexing works well.

Over time:

new data is added inconsistently
embeddings are not refreshed
metadata becomes unreliable

Result: irrelevant context → degraded outputs.

Latency Is a System-Level Issue

Latency builds across:

vector search
orchestration
LLM inference

Even small delays accumulate.

Using tools like AWS X-Ray, teams can visualise how latency compounds across services — something basic metrics alone cannot explain.

No Visibility Into Retrieval → Output

Most systems cannot answer:

which documents were retrieved
why they were selected
how they affected the final answer

Without this, debugging hallucinations is guesswork.

Cost Scales Quietly

Costs grow across:

embeddings
storage
token usage

Without visibility, optimisation happens too late.

What Actually Works on AWS (In Practice)

1. Data Pipelines Are Treated as Systems

Production setups include:

versioned ingestion
consistent chunking
enforced metadata schemas

Typical pattern:

Amazon S3 as source of truth
AWS Lambda for processing
AWS Step Functions for orchestration

In production systems, ingestion and chunking strategies are explicitly defined and versioned, rather than handled ad hoc.

2. Retrieval Is Layered

Working systems include:

hybrid search
metadata filtering
re-ranking

Common backend:

Amazon OpenSearch Service (vector search)
Amazon Aurora PostgreSQL with pgvector

Figure: Layered retrieval pipeline used in production RAG systems.

Retrieval is not static; it is continuously tuned based on query behaviour and evaluation metrics.

Reference: AWS OpenSearch vector search documentation

3. LLM Layer Is Abstracted

In production systems, the LLM layer is not a single model call.

Instead of locking into one model:

model switching is possible
prompts are versioned
fallback logic exists

Often implemented via:

Amazon Bedrock

Figure: LLM routing and abstraction layer with model selection and fallback logic.

The LLM layer is treated as a replaceable component, not a fixed dependency.

Reference: Amazon Bedrock monitoring documentation

4. Observability Is Built Into the Pipeline

Production systems track:

retrieval latency
token usage
pipeline failures
response quality signals

Using New Relic, teams can:

trace requests end-to-end
correlate issues across layers
monitor cost drivers

For example, a single request can be traced across:
retrieval latency → ranking → LLM response time
allowing teams to identify where degradation actually occurs.

Observability is not limited to infrastructure; it extends to retrieval behaviour and model interactions.

Reference: AWS X-Ray service map documentation

A Practical Reference Flow

Data stored in S3
Processed via Lambda / Step Functions
Embedded and stored in OpenSearch
Query handled via API layer (often on EKS)
Retrieval + ranking
Context sent to Bedrock
Observability layer tracks full request lifecycle

A working RAG system is defined not by individual components, but by how these components interact under load.

Design Decisions That Actually Matter

Retrieval strategy > model choice
Data quality > prompt engineering
Observability > optimisation
Iteration speed > initial design

Final Thought

RAG systems do not fail suddenly.

They degrade.

Slightly worse retrieval.
Slightly higher latency.
Slightly lower relevance.

Until the system stops being useful.

And most teams don’t notice until users do.

The difference between a working demo and a production system is architecture.

If you're working on RAG systems in production, the challenges are usually architectural rather than model-related.

More on how we approach AWS-based AI platforms: https://www.bionconsulting.com/ai-platform-architecture

View full post