Bion Blog

RAG Architecture on AWS: What Actually Works in Production

Written by Bion DevOps Team | Apr 24, 2026 1:17:22 PM

Most RAG systems don’t fail because of the model.

They fail because the surrounding system is not designed for production.

Early implementations usually follow a simple pattern:

  • documents embedded
  • stored in a vector database
  • retrieved and passed to an LLM

This works in controlled environments.

In production, the same setup starts to degrade:

  • retrieval quality drops
  • latency becomes unpredictable
  • responses lose consistency
  • debugging becomes difficult

RAG is not a model problem. It is a distributed systems problem.

Where Production RAG Systems Actually Break
Retrieval Quality Degrades Over Time

Initial indexing works well.

Over time:

  • new data is added inconsistently
  • embeddings are not refreshed
  • metadata becomes unreliable

Result: irrelevant context → degraded outputs.

Latency Is a System-Level Issue

Latency builds across:

  • vector search
  • orchestration
  • LLM inference

Even small delays accumulate.

Using tools like AWS X-Ray, teams can visualise how latency compounds across services — something basic metrics alone cannot explain.

No Visibility Into Retrieval → Output

Most systems cannot answer:

  • which documents were retrieved
  • why they were selected
  • how they affected the final answer

Without this, debugging hallucinations is guesswork.

Cost Scales Quietly

Costs grow across:

  • embeddings
  • storage
  • token usage

Without visibility, optimisation happens too late.

What Actually Works on AWS (In Practice)

1. Data Pipelines Are Treated as Systems

Production setups include:

  • versioned ingestion
  • consistent chunking
  • enforced metadata schemas

Typical pattern:

  • Amazon S3 as source of truth
  • AWS Lambda for processing
  • AWS Step Functions for orchestration

In production systems, ingestion and chunking strategies are explicitly defined and versioned, rather than handled ad hoc.

2. Retrieval Is Layered

Working systems include:

  • hybrid search
  • metadata filtering
  • re-ranking

Common backend:

  • Amazon OpenSearch Service (vector search)
  • Amazon Aurora PostgreSQL with pgvector


Figure: Layered retrieval pipeline used in production RAG systems.

Retrieval is not static; it is continuously tuned based on query behaviour and evaluation metrics.

Reference: AWS OpenSearch vector search documentation 

3. LLM Layer Is Abstracted

In production systems, the LLM layer is not a single model call.

Instead of locking into one model:

  • model switching is possible
  • prompts are versioned
  • fallback logic exists

Often implemented via:

  • Amazon Bedrock

Figure: LLM routing and abstraction layer with model selection and fallback logic.

The LLM layer is treated as a replaceable component, not a fixed dependency.

Reference: Amazon Bedrock monitoring documentation

4. Observability Is Built Into the Pipeline

Production systems track:

  • retrieval latency
  • token usage
  • pipeline failures
  • response quality signals

Using New Relic, teams can:

  • trace requests end-to-end
  • correlate issues across layers
  • monitor cost drivers

For example, a single request can be traced across:
retrieval latency → ranking → LLM response time
allowing teams to identify where degradation actually occurs.

Observability is not limited to infrastructure; it extends to retrieval behaviour and model interactions.

Reference: AWS X-Ray service map documentation

A Practical Reference Flow

  1. Data stored in S3
  2. Processed via Lambda / Step Functions
  3. Embedded and stored in OpenSearch
  4. Query handled via API layer (often on EKS)
  5. Retrieval + ranking
  6. Context sent to Bedrock
  7. Observability layer tracks full request lifecycle

A working RAG system is defined not by individual components, but by how these components interact under load.

Design Decisions That Actually Matter

  • Retrieval strategy > model choice
  • Data quality > prompt engineering
  • Observability > optimisation
  • Iteration speed > initial design

Final Thought

RAG systems do not fail suddenly.

They degrade.

Slightly worse retrieval.
Slightly higher latency.
Slightly lower relevance.

Until the system stops being useful.

And most teams don’t notice until users do.

The difference between a working demo and a production system is architecture.

 

If you're working on RAG systems in production, the challenges are usually architectural rather than model-related.

More on how we approach AWS-based AI platforms: https://www.bionconsulting.com/ai-platform-architecture