🎯 Design RAG Architecture
1️⃣ Core Framework
When discussing RAG Architecture, I frame it as:
- Data ingestion and preprocessing
- Embedding and indexing
- Retrieval layer
- Context construction
- LLM generation
- Post-processing and validation
- Evaluation and feedback loop
- Trade-offs: recall vs latency vs cost
2️⃣ What Is RAG?
RAG = Retrieval-Augmented Generation
Retrieve knowledge
→ Add to prompt
→ LLM generates answer
Why RAG?
LLMs alone have limitations:
- No access to private data
- Knowledge may be outdated
- Hallucination risk
- Cannot scale with large corpora
RAG solves this by grounding answers in external data.
👉 Interview Answer
RAG is a system design pattern that combines retrieval with generation.
Instead of relying only on the model’s internal knowledge, the system retrieves relevant documents at runtime and injects them into the prompt, improving factual accuracy and enabling access to private or fresh data.
3️⃣ High-Level Architecture
┌───────────────┐
│ Raw Data │
└──────┬────────┘
↓
┌──────────────────────────┐
│ Ingestion & Processing │
└──────────┬───────────────┘
↓
┌──────────────────────────┐
│ Embedding + Indexing │
└──────────┬───────────────┘
↓
┌──────────────────────────┐
│ Vector Database │
└──────────┬───────────────┘
User Query → Query Embedding → Retrieval → Context Builder → LLM → Answer
👉 Interview Answer
A RAG system has two main pipelines:
- Offline pipeline for ingestion and indexing
- Online pipeline for query-time retrieval and generation
This separation allows efficient querying over large datasets.
4️⃣ Data Ingestion Pipeline (Offline)
Steps
Raw documents
→ Cleaning
→ Chunking
→ Metadata extraction
→ Embedding
→ Indexing into vector DB
Chunking
Split documents into smaller pieces:
Large doc → chunks (200–500 tokens each)
Why Chunk?
- Fits into context window
- Improves retrieval granularity
- Reduces noise
Metadata
Example:
{
"doc_id": "doc123",
"section": "refund policy",
"source": "help_center",
"created_at": "2026-01-01"
}
👉 Interview Answer
The ingestion pipeline prepares data for retrieval.
Documents are cleaned, split into chunks, enriched with metadata, converted into embeddings, and stored in a vector database.
Chunking is critical because retrieval operates at the chunk level, not the full document level.
5️⃣ Embedding and Indexing
Embedding
Convert text into vectors:
text → embedding vector
Vector Database
Stores:
embedding + metadata + text chunk
Index Types
- Flat index (exact search)
- Approximate nearest neighbor (ANN)
- HNSW (common for large-scale systems)
Storage Example
{
"embedding": [...],
"text": "Refunds are processed within 5 days...",
"metadata": {
"doc_id": "123",
"section": "refund"
}
}
👉 Interview Answer
Embeddings transform text into vectors, enabling semantic search.
A vector database stores embeddings along with metadata, and supports nearest-neighbor search to find relevant content efficiently.
6️⃣ Retrieval Layer (Online)
Retrieval Flow
User query
→ Convert to embedding
→ Search vector DB
→ Retrieve top-K chunks
Retrieval Types
Semantic Search
Vector similarity.
Keyword Search
BM25 / inverted index.
Hybrid Search
Combine both:
score = alpha * semantic + beta * keyword
Re-ranking
After retrieval:
Top 50 → re-rank → Top 5
👉 Interview Answer
Retrieval finds relevant chunks using vector similarity.
In production systems, hybrid retrieval is often used, combining semantic search and keyword search.
Re-ranking improves precision by selecting the most relevant results.
7️⃣ Context Construction
Problem
We cannot pass all retrieved documents.
Solution
Select and build context:
Top-K chunks
→ Deduplicate
→ Rank
→ Truncate
→ Format into prompt
Prompt Example
You are a helpful assistant.
Context:
[Doc1]
[Doc2]
[Doc3]
Question:
...
Answer based only on context.
Techniques
- Max token budgeting
- Context compression
- Summarization
- Deduplication
- Section prioritization
👉 Interview Answer
Context construction is critical.
Even if retrieval finds relevant documents, the system must carefully select and format them within the context window.
Poor context construction leads to hallucination or irrelevant answers.
8️⃣ LLM Generation
Inputs
- System instructions
- User query
- Retrieved context
Output
- Natural language answer
- Structured output (JSON, tables)
- Tool calls (optional)
Prompt Constraints
Only use provided context
If not found, say "I don't know"
👉 Interview Answer
The LLM generates answers using retrieved context.
Proper prompting is required to ensure the model uses only the provided documents and avoids hallucinating unsupported facts.
9️⃣ Post-processing
Tasks
- Validate output format
- Add citations
- Filter unsafe content
- Normalize response
- Retry if needed
Example
Answer + [source1, source2]
👉 Interview Answer
After generation, the system should validate the output, attach citations, enforce safety policies, and retry if the response is invalid or incomplete.
🔟 Evaluation
Offline Metrics
- Recall@K (retrieval quality)
- Precision
- Answer accuracy
- Faithfulness
- Context relevance
Online Metrics
- User satisfaction
- Click-through rate
- Answer acceptance
- Latency
- Cost
- Failure rate
👉 Interview Answer
RAG systems must evaluate both retrieval and generation.
Retrieval quality affects what the model sees, and generation quality affects final answers.
Both must be measured and improved continuously.
1️⃣1️⃣ Advanced Techniques
Query Rewriting
User query → rewritten query → retrieval
Improves recall.
Multi-step Retrieval
Step1 retrieve → refine query → retrieve again
Multi-hop RAG
Used for complex reasoning across documents.
Fusion Retrieval
Combine multiple retrievers.
Self-Reflection
LLM checks its own answer.
👉 Interview Answer
Advanced RAG systems improve retrieval using query rewriting, multi-step retrieval, and re-ranking.
These techniques help handle ambiguous queries and complex multi-hop questions.
1️⃣2️⃣ Scaling Challenges
Challenges
- Large document corpus
- High query QPS
- Long context
- Embedding cost
- Retrieval latency
- Multi-tenant isolation
Solutions
- Sharding vector DB
- Caching embeddings
- Pre-computing queries
- Tiered storage
- Approximate search
- Parallel retrieval
👉 Interview Answer
At scale, RAG systems must optimize retrieval latency and cost.
Techniques include ANN indexing, caching, sharding, and limiting context size.
1️⃣3️⃣ Failure Modes
Common Issues
- Irrelevant retrieval
- Missing key document
- Hallucination
- Context overflow
- Duplicate chunks
- Outdated data
- Prompt injection via documents
Mitigations
- Better chunking
- Re-ranking
- Strict prompt constraints
- Source filtering
- Metadata filtering
- Validation layer
- Trust scoring
👉 Interview Answer
Most RAG failures come from poor retrieval or bad context.
Improving chunking, ranking, filtering, and prompt constraints is critical to system reliability.
1️⃣4️⃣ Trade-offs
| Dimension | Trade-off |
|---|---|
| Recall vs Precision | More docs vs more relevant docs |
| Latency vs Quality | More retrieval steps vs faster response |
| Cost vs Accuracy | Larger context vs cheaper calls |
| Freshness vs Consistency | Real-time vs cached data |
👉 Interview Answer
RAG design involves balancing recall, precision, latency, and cost.
More retrieved documents can improve recall, but may introduce noise and increase latency.
1️⃣5️⃣ End-to-End Flow
Full Flow
User query
→ Query rewrite (optional)
→ Query embedding
→ Retrieve top-K documents
→ Re-rank
→ Build context
→ LLM generates answer
→ Validate output
→ Return response with citations
Key Insight
RAG is a pipeline, not a single step.
🧠 Staff-Level Answer (Final)
👉 Interview Answer Full Version
When designing a RAG system, I think of it as a pipeline that combines retrieval and generation.
The system has an offline ingestion pipeline that processes documents, splits them into chunks, generates embeddings, and stores them in a vector database.
At query time, the system converts the user query into an embedding, retrieves the most relevant chunks, optionally re-ranks them, and constructs a context within the model’s token limit.
The LLM then generates an answer using this context.
A key challenge is ensuring that retrieved documents are both relevant and sufficient.
Poor retrieval leads to hallucination, even if the model is strong.
Therefore, techniques like hybrid search, re-ranking, query rewriting, and metadata filtering are important.
The system should also validate outputs, attach citations, and enforce safety constraints.
At scale, we must optimize for latency and cost using approximate search, caching, and limiting context size.
Ultimately, the goal of RAG is to ground LLM outputs in real data, improving accuracy, freshness, and trustworthiness.
⭐ Final Insight
RAG 的本质不是“加一个 vector DB”, 而是一个 retrieval + context + generation 的完整 pipeline, 其中 retrieval quality 决定了整个系统的上限。
中文部分
🎯 RAG Architecture 设计
1️⃣ 核心框架
设计 RAG 时可以从:
- 数据 ingestion
- Embedding 和 indexing
- Retrieval 层
- Context 构建
- LLM 生成
- 后处理
- Evaluation
- 权衡:recall vs latency vs cost
2️⃣ 什么是 RAG?
RAG = 检索增强生成
先检索 → 再生成
👉 面试回答
RAG 是把 retrieval 和 LLM generation 结合起来的架构, 通过在 runtime 检索相关文档并注入 prompt, 提升准确性和支持私有数据。
3️⃣ 核心架构
离线:数据 → embedding → 向量库
在线:query → 检索 → context → LLM → answer
4️⃣ Ingestion Pipeline
数据 → 清洗 → 切块 → embedding → 存储
关键点:chunking 很重要
5️⃣ Retrieval
query → embedding → vector search → top-K
6️⃣ Context 构建
选 top-K → 去重 → 截断 → 拼 prompt
7️⃣ Generation
context + question → LLM → answer
8️⃣ 关键优化点
- Hybrid retrieval
- Re-ranking
- Query rewrite
- Context compression
9️⃣ 常见问题
- 检索不到关键文档
- 检索到无关内容
- hallucination
- context 太长
- 数据过期
🔟 核心 Trade-off
- Recall vs Precision
- Latency vs Quality
- Cost vs Context size
🧠 面试总结版
RAG 是一个 retrieval + generation pipeline。
离线负责数据处理, 在线负责 query 检索和回答生成。
系统的关键在 retrieval quality, 因为模型只能基于给它的 context 进行推理。
一个好的 RAG 系统需要优化 chunking、retrieval、ranking 和 prompt, 并在 latency、cost 和 accuracy 之间做平衡。
⭐ 最终一句话
RAG 的上限不在模型,而在 retrieval。
Implement