Most RAG systems don’t fail because of the model.
They fail because the surrounding system is not designed for production.
Early implementations usually follow a simple pattern:
- documents embedded
- stored in a vector database
- retrieved and passed to an LLM
This works in controlled environments.
In production, the same setup starts to degrade:
- retrieval quality drops
- latency becomes unpredictable
- responses lose consistency
- debugging becomes difficult
RAG is not a model problem. It is a distributed systems problem.
Where Production RAG Systems Actually Break
Retrieval Quality Degrades Over Time
Initial indexing works well.
Over time:
- new data is added inconsistently
- embeddings are not refreshed
- metadata becomes unreliable
Result: irrelevant context → degraded outputs.
Latency Is a System-Level Issue
Latency builds across:
- vector search
- orchestration
- LLM inference
Even small delays accumulate.
Using tools like AWS X-Ray, teams can visualise how latency compounds across services — something basic metrics alone cannot explain.
No Visibility Into Retrieval → Output
Most systems cannot answer:
- which documents were retrieved
- why they were selected
- how they affected the final answer
Without this, debugging hallucinations is guesswork.
Cost Scales Quietly
Costs grow across:
- embeddings
- storage
- token usage
Without visibility, optimisation happens too late.
What Actually Works on AWS (In Practice)
1. Data Pipelines Are Treated as Systems
Production setups include:
- versioned ingestion
- consistent chunking
- enforced metadata schemas
Typical pattern:
- Amazon S3 as source of truth
- AWS Lambda for processing
- AWS Step Functions for orchestration
In production systems, ingestion and chunking strategies are explicitly defined and versioned, rather than handled ad hoc.
2. Retrieval Is Layered
Working systems include:
- hybrid search
- metadata filtering
- re-ranking
Common backend:
- Amazon OpenSearch Service (vector search)
- Amazon Aurora PostgreSQL with pgvector

Figure: Layered retrieval pipeline used in production RAG systems.
Retrieval is not static; it is continuously tuned based on query behaviour and evaluation metrics.
Reference: AWS OpenSearch vector search documentation
3. LLM Layer Is Abstracted
In production systems, the LLM layer is not a single model call.
Instead of locking into one model:
- model switching is possible
- prompts are versioned
- fallback logic exists
Often implemented via:
- Amazon Bedrock

Figure: LLM routing and abstraction layer with model selection and fallback logic.
The LLM layer is treated as a replaceable component, not a fixed dependency.
Reference: Amazon Bedrock monitoring documentation
4. Observability Is Built Into the Pipeline
Production systems track:
- retrieval latency
- token usage
- pipeline failures
- response quality signals
Using New Relic, teams can:
- trace requests end-to-end
- correlate issues across layers
- monitor cost drivers
For example, a single request can be traced across:
retrieval latency → ranking → LLM response time
allowing teams to identify where degradation actually occurs.
Observability is not limited to infrastructure; it extends to retrieval behaviour and model interactions.
Reference: AWS X-Ray service map documentation
A Practical Reference Flow
- Data stored in S3
- Processed via Lambda / Step Functions
- Embedded and stored in OpenSearch
- Query handled via API layer (often on EKS)
- Retrieval + ranking
- Context sent to Bedrock
- Observability layer tracks full request lifecycle
A working RAG system is defined not by individual components, but by how these components interact under load.
Design Decisions That Actually Matter
- Retrieval strategy > model choice
- Data quality > prompt engineering
- Observability > optimisation
- Iteration speed > initial design
Final Thought
RAG systems do not fail suddenly.
They degrade.
Slightly worse retrieval.
Slightly higher latency.
Slightly lower relevance.
Until the system stops being useful.
And most teams don’t notice until users do.
The difference between a working demo and a production system is architecture.
If you're working on RAG systems in production, the challenges are usually architectural rather than model-related.
More on how we approach AWS-based AI platforms: https://www.bionconsulting.com/ai-platform-architecture