System Design Deep Dive - 02 Design RAG Architecture

Post by ailswan May. 25, 2026

中文 ↓

🎯 Design RAG Architecture


1️⃣ Core Framework

When discussing RAG Architecture, I frame it as:

  1. Data ingestion and preprocessing
  2. Embedding and indexing
  3. Retrieval layer
  4. Context construction
  5. LLM generation
  6. Post-processing and validation
  7. Evaluation and feedback loop
  8. Trade-offs: recall vs latency vs cost

2️⃣ What Is RAG?

RAG = Retrieval-Augmented Generation

Retrieve knowledge
→ Add to prompt
→ LLM generates answer

Why RAG?

LLMs alone have limitations:

RAG solves this by grounding answers in external data.


👉 Interview Answer

RAG is a system design pattern that combines retrieval with generation.

Instead of relying only on the model’s internal knowledge, the system retrieves relevant documents at runtime and injects them into the prompt, improving factual accuracy and enabling access to private or fresh data.


3️⃣ High-Level Architecture


                ┌───────────────┐
                │  Raw Data     │
                └──────┬────────┘
                       ↓
        ┌──────────────────────────┐
        │ Ingestion & Processing   │
        └──────────┬───────────────┘
                   ↓
        ┌──────────────────────────┐
        │ Embedding + Indexing     │
        └──────────┬───────────────┘
                   ↓
        ┌──────────────────────────┐
        │ Vector Database          │
        └──────────┬───────────────┘

User Query → Query Embedding → Retrieval → Context Builder → LLM → Answer

👉 Interview Answer

A RAG system has two main pipelines:

  1. Offline pipeline for ingestion and indexing
  2. Online pipeline for query-time retrieval and generation

This separation allows efficient querying over large datasets.


4️⃣ Data Ingestion Pipeline (Offline)


Steps

Raw documents
→ Cleaning
→ Chunking
→ Metadata extraction
→ Embedding
→ Indexing into vector DB

Chunking

Split documents into smaller pieces:

Large doc → chunks (200–500 tokens each)

Why Chunk?


Metadata

Example:

{
  "doc_id": "doc123",
  "section": "refund policy",
  "source": "help_center",
  "created_at": "2026-01-01"
}

👉 Interview Answer

The ingestion pipeline prepares data for retrieval.

Documents are cleaned, split into chunks, enriched with metadata, converted into embeddings, and stored in a vector database.

Chunking is critical because retrieval operates at the chunk level, not the full document level.


5️⃣ Embedding and Indexing


Embedding

Convert text into vectors:

text → embedding vector

Vector Database

Stores:

embedding + metadata + text chunk

Index Types


Storage Example

{
  "embedding": [...],
  "text": "Refunds are processed within 5 days...",
  "metadata": {
    "doc_id": "123",
    "section": "refund"
  }
}

👉 Interview Answer

Embeddings transform text into vectors, enabling semantic search.

A vector database stores embeddings along with metadata, and supports nearest-neighbor search to find relevant content efficiently.


6️⃣ Retrieval Layer (Online)


Retrieval Flow

User query
→ Convert to embedding
→ Search vector DB
→ Retrieve top-K chunks

Retrieval Types

Vector similarity.


BM25 / inverted index.


Combine both:

score = alpha * semantic + beta * keyword

Re-ranking

After retrieval:

Top 50 → re-rank → Top 5

👉 Interview Answer

Retrieval finds relevant chunks using vector similarity.

In production systems, hybrid retrieval is often used, combining semantic search and keyword search.

Re-ranking improves precision by selecting the most relevant results.


7️⃣ Context Construction


Problem

We cannot pass all retrieved documents.


Solution

Select and build context:

Top-K chunks
→ Deduplicate
→ Rank
→ Truncate
→ Format into prompt

Prompt Example

You are a helpful assistant.

Context:
[Doc1]
[Doc2]
[Doc3]

Question:
...

Answer based only on context.

Techniques


👉 Interview Answer

Context construction is critical.

Even if retrieval finds relevant documents, the system must carefully select and format them within the context window.

Poor context construction leads to hallucination or irrelevant answers.


8️⃣ LLM Generation


Inputs


Output


Prompt Constraints

Only use provided context
If not found, say "I don't know"

👉 Interview Answer

The LLM generates answers using retrieved context.

Proper prompting is required to ensure the model uses only the provided documents and avoids hallucinating unsupported facts.


9️⃣ Post-processing


Tasks


Example

Answer + [source1, source2]

👉 Interview Answer

After generation, the system should validate the output, attach citations, enforce safety policies, and retry if the response is invalid or incomplete.


🔟 Evaluation


Offline Metrics


Online Metrics


👉 Interview Answer

RAG systems must evaluate both retrieval and generation.

Retrieval quality affects what the model sees, and generation quality affects final answers.

Both must be measured and improved continuously.


1️⃣1️⃣ Advanced Techniques


Query Rewriting

User query → rewritten query → retrieval

Improves recall.


Multi-step Retrieval

Step1 retrieve → refine query → retrieve again

Multi-hop RAG

Used for complex reasoning across documents.


Fusion Retrieval

Combine multiple retrievers.


Self-Reflection

LLM checks its own answer.


👉 Interview Answer

Advanced RAG systems improve retrieval using query rewriting, multi-step retrieval, and re-ranking.

These techniques help handle ambiguous queries and complex multi-hop questions.


1️⃣2️⃣ Scaling Challenges


Challenges


Solutions


👉 Interview Answer

At scale, RAG systems must optimize retrieval latency and cost.

Techniques include ANN indexing, caching, sharding, and limiting context size.


1️⃣3️⃣ Failure Modes


Common Issues


Mitigations


👉 Interview Answer

Most RAG failures come from poor retrieval or bad context.

Improving chunking, ranking, filtering, and prompt constraints is critical to system reliability.


1️⃣4️⃣ Trade-offs


Dimension Trade-off
Recall vs Precision More docs vs more relevant docs
Latency vs Quality More retrieval steps vs faster response
Cost vs Accuracy Larger context vs cheaper calls
Freshness vs Consistency Real-time vs cached data

👉 Interview Answer

RAG design involves balancing recall, precision, latency, and cost.

More retrieved documents can improve recall, but may introduce noise and increase latency.


1️⃣5️⃣ End-to-End Flow


Full Flow

User query
→ Query rewrite (optional)
→ Query embedding
→ Retrieve top-K documents
→ Re-rank
→ Build context
→ LLM generates answer
→ Validate output
→ Return response with citations

Key Insight

RAG is a pipeline, not a single step.


🧠 Staff-Level Answer (Final)


👉 Interview Answer Full Version

When designing a RAG system, I think of it as a pipeline that combines retrieval and generation.

The system has an offline ingestion pipeline that processes documents, splits them into chunks, generates embeddings, and stores them in a vector database.

At query time, the system converts the user query into an embedding, retrieves the most relevant chunks, optionally re-ranks them, and constructs a context within the model’s token limit.

The LLM then generates an answer using this context.

A key challenge is ensuring that retrieved documents are both relevant and sufficient.

Poor retrieval leads to hallucination, even if the model is strong.

Therefore, techniques like hybrid search, re-ranking, query rewriting, and metadata filtering are important.

The system should also validate outputs, attach citations, and enforce safety constraints.

At scale, we must optimize for latency and cost using approximate search, caching, and limiting context size.

Ultimately, the goal of RAG is to ground LLM outputs in real data, improving accuracy, freshness, and trustworthiness.


⭐ Final Insight

RAG 的本质不是“加一个 vector DB”, 而是一个 retrieval + context + generation 的完整 pipeline, 其中 retrieval quality 决定了整个系统的上限。



中文部分


🎯 RAG Architecture 设计


1️⃣ 核心框架

设计 RAG 时可以从:

  1. 数据 ingestion
  2. Embedding 和 indexing
  3. Retrieval 层
  4. Context 构建
  5. LLM 生成
  6. 后处理
  7. Evaluation
  8. 权衡:recall vs latency vs cost

2️⃣ 什么是 RAG?

RAG = 检索增强生成

先检索 → 再生成

👉 面试回答

RAG 是把 retrieval 和 LLM generation 结合起来的架构, 通过在 runtime 检索相关文档并注入 prompt, 提升准确性和支持私有数据。


3️⃣ 核心架构


离线:数据 → embedding → 向量库  
在线:query → 检索 → context → LLM → answer

4️⃣ Ingestion Pipeline


数据 → 清洗 → 切块 → embedding → 存储

关键点:chunking 很重要


5️⃣ Retrieval


query → embedding → vector search → top-K

6️⃣ Context 构建


选 top-K → 去重 → 截断 → 拼 prompt

7️⃣ Generation


context + question → LLM → answer

8️⃣ 关键优化点



9️⃣ 常见问题



🔟 核心 Trade-off



🧠 面试总结版


RAG 是一个 retrieval + generation pipeline。

离线负责数据处理, 在线负责 query 检索和回答生成。

系统的关键在 retrieval quality, 因为模型只能基于给它的 context 进行推理。

一个好的 RAG 系统需要优化 chunking、retrieval、ranking 和 prompt, 并在 latency、cost 和 accuracy 之间做平衡。


⭐ 最终一句话

RAG 的上限不在模型,而在 retrieval。

Implement