·

System Design Deep Dive - 01 RAG Architecture Explained for Engineers

Post by ailswan May. 24, 2026

中文 ↓

🎯 RAG Architecture Explained for Engineers


1️⃣ Core Framework

When discussing RAG Architecture, I frame it as:

  1. Why RAG is needed
  2. Knowledge ingestion
  3. Chunking and embeddings
  4. Vector storage and indexing
  5. Query rewriting and retrieval
  6. Ranking and context building
  7. Generation with citations
  8. Trade-offs: accuracy vs latency vs cost

2️⃣ What Is RAG?

RAG means Retrieval-Augmented Generation.

It combines:

External Knowledge Retrieval
+ LLM Generation

Instead of asking the LLM to answer only from model memory, the system retrieves relevant knowledge at runtime and gives it to the model as context.


Basic Flow

User Question
→ Retrieve relevant documents
→ Add documents to prompt
→ LLM generates grounded answer

👉 Interview Answer

RAG is an architecture where the system retrieves relevant external knowledge at runtime and provides that knowledge to the LLM as context.

This helps the model answer questions using private, updated, or domain-specific information instead of relying only on model memory.


3️⃣ Why Do We Need RAG?


LLM Limitations

LLMs may not know:


Without RAG

User asks about internal policy
→ LLM guesses
→ Risk of hallucination

With RAG

User asks about internal policy
→ System retrieves policy document
→ LLM answers using retrieved context

👉 Interview Answer

RAG is useful because LLMs do not automatically know private or updated information.

By retrieving relevant documents at runtime, the system can ground the answer in real sources, reduce hallucination, and support enterprise knowledge use cases.


4️⃣ High-Level RAG Architecture


Architecture

Documents
→ Ingestion Pipeline
→ Chunking
→ Embedding Model
→ Vector Database
→ Retriever
→ Ranker
→ Prompt Builder
→ LLM
→ Answer with Sources

Two Main Paths

Offline Path

Documents
→ Clean
→ Chunk
→ Embed
→ Store

Online Path

User Query
→ Embed Query
→ Retrieve Chunks
→ Rank Results
→ Build Prompt
→ Generate Answer

👉 Interview Answer

RAG usually has two paths.

The offline ingestion path processes documents, chunks them, creates embeddings, and stores them in a search index or vector database.

The online query path retrieves relevant chunks, ranks them, builds the prompt, and sends the grounded context to the LLM.


5️⃣ Ingestion Pipeline


What Is Ingestion?

Ingestion is the process of preparing knowledge for retrieval.


Ingestion Steps

Raw Documents
→ Parse
→ Clean
→ Normalize
→ Chunk
→ Embed
→ Store

Input Sources


Why Ingestion Matters

Bad ingestion leads to bad retrieval.


👉 Interview Answer

The ingestion pipeline prepares documents for RAG.

It parses raw data, cleans text, splits documents into chunks, generates embeddings, and stores them in a retrievable index.

Ingestion quality directly affects retrieval quality.


6️⃣ Chunking


What Is Chunking?

Chunking means splitting large documents into smaller pieces.

Long document
→ Chunk 1
→ Chunk 2
→ Chunk 3

Why Chunking Is Needed

LLMs and retrievers work better with focused pieces of text.


Chunking Strategies

Strategy Use Case
Fixed-size chunking Simple documents
Semantic chunking Structured knowledge
Section-based chunking Markdown / docs
Sliding window Preserve overlap
Code-aware chunking Code repositories

Chunk Size Trade-off

Chunk Size Strength Weakness
Small chunks Precise retrieval May lose context
Large chunks More context Less precise and more costly

👉 Interview Answer

Chunking is important because retrieval happens at the chunk level.

If chunks are too small, the system may lose context.

If chunks are too large, retrieval becomes less precise and prompts become expensive.

Good chunking balances precision and context.


7️⃣ Embeddings


What Is an Embedding?

An embedding converts text into a vector.

"refund policy"
→ [0.12, -0.45, 0.89, ...]

Texts with similar meaning have similar vectors.


Embedding Flow

Document chunk
→ Embedding model
→ Vector
→ Store in vector database

Query Embedding

User query
→ Embedding model
→ Query vector
→ Search similar document vectors

👉 Interview Answer

Embeddings allow semantic search.

The system converts document chunks and user queries into vectors, then searches for chunks with similar meaning.

This is the foundation of vector-based RAG retrieval.


8️⃣ Vector Database


What Does Vector DB Store?

A vector database stores:


Example Record

{
  "chunk_id": "chunk_123",
  "document_id": "doc_456",
  "text": "Refunds are allowed within 30 days...",
  "embedding": [0.12, -0.45, 0.89],
  "metadata": {
    "source": "refund_policy.md",
    "updated_at": "2026-05-24",
    "department": "support"
  }
}

Why Metadata Matters

Metadata supports:


👉 Interview Answer

A vector database stores embeddings, document chunks, and metadata.

Metadata is important because it enables filtering, access control, freshness checks, citation generation, and debugging.


9️⃣ Retrieval


Retrieval Flow

User Query
→ Query Embedding
→ Vector Search
→ Candidate Chunks
→ Ranking
→ Selected Context

Retrieval Types

Type Description
Vector search Semantic similarity
Keyword search Exact term matching
Hybrid search Vector + keyword
Metadata filtering Filter by source, date, permission
Graph retrieval Relationship-aware retrieval

Why Hybrid Search Is Common

Vector search is good for meaning.

Keyword search is good for exact terms.

Hybrid search combines both.


👉 Interview Answer

Retrieval is the process of finding relevant chunks for the user query.

Many production RAG systems use hybrid retrieval, combining vector search for semantic similarity with keyword search for exact matches.

Metadata filtering is also important for permissions and freshness.


🔟 Ranking and Re-ranking


Why Ranking Is Needed

Initial retrieval may return noisy results.

The system must decide which chunks are most useful.


Ranking Signals


Re-ranking Flow

Retrieve top 50 chunks
→ Re-ranker scores chunks
→ Select top 5 to 10 chunks
→ Add to prompt

👉 Interview Answer

Retrieval alone is often not enough.

Production RAG systems usually rank or re-rank candidate chunks before adding them to the prompt.

This improves relevance, reduces noise, and controls context size.


1️⃣1️⃣ Prompt Building


What Prompt Builder Does

The prompt builder combines:


Prompt Structure

System instruction
User question
Retrieved context
Rules:
- Answer only from provided context
- Cite sources
- Say when context is insufficient
Output format

Important Rule

The model should know what context it can trust.


👉 Interview Answer

The prompt builder decides how retrieved knowledge is presented to the LLM.

It should include the user question, relevant chunks, citation metadata, output format, and instructions for handling insufficient context.

Good prompt construction improves factuality and consistency.


1️⃣2️⃣ Generation


Generation Step

The LLM receives retrieved context and generates the answer.

Retrieved Context
+ User Question
+ Instructions
→ LLM Answer

Good RAG Answer Should


Failure Example

Retrieved context does not answer question
→ LLM guesses anyway

👉 Interview Answer

In the generation step, the LLM should answer using retrieved context.

If the retrieved context is insufficient, the system should instruct the model to say so instead of guessing.

This is critical for reducing hallucination.


1️⃣3️⃣ Access Control


Why Access Control Matters

Enterprise RAG may contain sensitive documents.

Users should only retrieve documents they are allowed to see.


Access Control Points


Example

User from Team A
→ Retrieve only documents allowed for Team A

👉 Interview Answer

Access control is critical in enterprise RAG.

The retrieval system must filter documents based on user permissions before adding context to the prompt.

The LLM should never receive unauthorized content.


1️⃣4️⃣ Evaluation


What to Evaluate

RAG systems need evaluation at multiple layers.


Retrieval Metrics


Generation Metrics


Production Signals


👉 Interview Answer

RAG evaluation should measure both retrieval quality and generation quality.

Good retrieval means the right documents are found.

Good generation means the answer is faithful to those documents, cites sources, and avoids unsupported claims.


1️⃣5️⃣ Common Failure Modes


Failure Modes

RAG systems can fail because:


Example

Wrong chunk retrieved
→ LLM answers confidently
→ User receives incorrect answer

👉 Interview Answer

RAG failures can happen at any layer: ingestion, chunking, embedding, retrieval, ranking, prompt building, generation, or access control.

Debugging RAG requires tracing the full pipeline.


1️⃣6️⃣ Best Practices


Practical Rules


Design Principle

RAG quality depends more on retrieval and context engineering
than on the LLM alone.

👉 Interview Answer

A good RAG system is not just a vector database plus an LLM.

It requires high-quality ingestion, chunking, metadata, retrieval, ranking, prompt building, access control, evaluation, and observability.


🧠 Staff-Level Answer Final


👉 Interview Answer Full Version

RAG, or Retrieval-Augmented Generation, is an architecture where the system retrieves relevant external knowledge at runtime and provides it to the LLM as context.

This is useful because LLMs do not automatically know private, recent, or domain-specific information.

A production RAG system usually has two paths: an offline ingestion path and an online query path.

The offline path parses documents, cleans text, chunks documents, generates embeddings, and stores chunks with metadata in a vector database or search index.

The online path receives the user query, optionally rewrites it, embeds it, retrieves candidate chunks, ranks or re-ranks them, builds the prompt, and sends the grounded context to the LLM.

The quality of RAG depends heavily on ingestion and retrieval.

Bad chunking, weak metadata, stale documents, or poor retrieval can cause the LLM to produce incorrect answers even if the model is strong.

In production, I would usually use hybrid retrieval, combining vector search for semantic similarity with keyword search for exact matches.

I would also store metadata such as document ID, source, timestamp, owner, access control tags, and freshness signals.

Before sending context to the model, the system should filter by permissions, rank results, remove irrelevant chunks, and keep context within token limits.

The prompt should instruct the model to answer only from retrieved context, cite sources, and say when the context is insufficient.

RAG systems also need evaluation and observability.

I would measure retrieval recall, precision, relevance, permission correctness, answer faithfulness, citation quality, latency, and cost.

The key point is that RAG is not just “vector database plus LLM.”

It is a full knowledge system with ingestion, indexing, retrieval, ranking, prompt building, generation, access control, evaluation, and monitoring.


⭐ Final Insight

RAG 的核心不是“把文档塞给 LLM”。

真正的 RAG Architecture 是:

Ingestion

  • Chunking
  • Embeddings
  • Vector / Hybrid Search
  • Ranking
  • Prompt Building
  • Generation
  • Citations
  • Evaluation
  • Access Control。

好的 RAG 系统, 质量主要取决于 retrieval 和 context engineering, 而不是只取决于 LLM 本身。


中文部分


🎯 RAG Architecture Explained for Engineers


1️⃣ 核心框架

讨论 RAG Architecture 时,我通常从这些方面分析:

  1. 为什么需要 RAG
  2. Knowledge ingestion
  3. Chunking and embeddings
  4. Vector storage and indexing
  5. Query rewriting and retrieval
  6. Ranking and context building
  7. Generation with citations
  8. 核心权衡:accuracy vs latency vs cost

2️⃣ 什么是 RAG?

RAG 表示 Retrieval-Augmented Generation

它结合了:

External Knowledge Retrieval
+ LLM Generation

不是让 LLM 只依赖 model memory 回答, 而是在 runtime 检索相关知识, 再把这些知识作为 context 提供给 model。


Basic Flow

User Question
→ Retrieve relevant documents
→ Add documents to prompt
→ LLM generates grounded answer

👉 面试回答

RAG 是一种在 runtime 检索 external knowledge, 并把这些知识作为 context 提供给 LLM 的架构。

它让模型可以基于 private、updated 或 domain-specific information 回答问题, 而不是只依赖 model memory。


3️⃣ 为什么需要 RAG?


LLM Limitations

LLM 可能不知道:


Without RAG

User asks about internal policy
→ LLM guesses
→ Risk of hallucination

With RAG

User asks about internal policy
→ System retrieves policy document
→ LLM answers using retrieved context

👉 面试回答

RAG 有价值, 因为 LLM 不会自动知道 private 或 updated information。

通过在 runtime 检索相关 documents, 系统可以让答案基于真实来源, 降低 hallucination, 并支持 enterprise knowledge use cases。


4️⃣ High-Level RAG Architecture


Architecture

Documents
→ Ingestion Pipeline
→ Chunking
→ Embedding Model
→ Vector Database
→ Retriever
→ Ranker
→ Prompt Builder
→ LLM
→ Answer with Sources

Two Main Paths

Offline Path

Documents
→ Clean
→ Chunk
→ Embed
→ Store

Online Path

User Query
→ Embed Query
→ Retrieve Chunks
→ Rank Results
→ Build Prompt
→ Generate Answer

👉 面试回答

RAG 通常有两条路径: offline ingestion path 和 online query path。

Offline path 负责处理 documents、 chunking、embedding, 并把结果存到 search index 或 vector database。

Online path 负责检索相关 chunks、 ranking、prompt building, 并把 grounded context 发送给 LLM。


5️⃣ Ingestion Pipeline


什么是 Ingestion?

Ingestion 是把 knowledge 准备成可检索形式的过程。


Ingestion Steps

Raw Documents
→ Parse
→ Clean
→ Normalize
→ Chunk
→ Embed
→ Store

Input Sources


为什么 Ingestion 重要?

Bad ingestion 会导致 bad retrieval。


👉 面试回答

Ingestion pipeline 负责为 RAG 准备 documents。

它会 parse raw data、clean text、 split documents into chunks、 generate embeddings, 并把它们存入 retrievable index。

Ingestion quality 会直接影响 retrieval quality。


6️⃣ Chunking


什么是 Chunking?

Chunking 是把大文档切成更小片段。

Long document
→ Chunk 1
→ Chunk 2
→ Chunk 3

为什么需要 Chunking?

LLM 和 retriever 更适合处理聚焦的小文本块。


Chunking Strategies

Strategy Use Case
Fixed-size chunking Simple documents
Semantic chunking Structured knowledge
Section-based chunking Markdown / docs
Sliding window Preserve overlap
Code-aware chunking Code repositories

Chunk Size Trade-off

Chunk Size 优点 缺点
Small chunks Precise retrieval May lose context
Large chunks More context Less precise and more costly

👉 面试回答

Chunking 很重要, 因为 retrieval 是在 chunk level 发生的。

Chunk 太小会丢失 context。

Chunk 太大会降低 retrieval precision, 并增加 prompt cost。

好的 chunking 需要平衡 precision 和 context。


7️⃣ Embeddings


什么是 Embedding?

Embedding 把文本转换成向量。

"refund policy"
→ [0.12, -0.45, 0.89, ...]

语义相近的文本,向量也相近。


Embedding Flow

Document chunk
→ Embedding model
→ Vector
→ Store in vector database

Query Embedding

User query
→ Embedding model
→ Query vector
→ Search similar document vectors

👉 面试回答

Embeddings 让 semantic search 成为可能。

系统把 document chunks 和 user queries 转换成 vectors, 然后搜索语义相近的 chunks。

这是 vector-based RAG retrieval 的基础。


8️⃣ Vector Database


Vector DB 存什么?

Vector database 通常存储:


Example Record

{
  "chunk_id": "chunk_123",
  "document_id": "doc_456",
  "text": "Refunds are allowed within 30 days...",
  "embedding": [0.12, -0.45, 0.89],
  "metadata": {
    "source": "refund_policy.md",
    "updated_at": "2026-05-24",
    "department": "support"
  }
}

为什么 Metadata 重要?

Metadata 支持:


👉 面试回答

Vector database 存储 embeddings、 document chunks 和 metadata。

Metadata 很重要, 因为它支持 filtering、access control、 freshness checks、citation generation 和 debugging。


9️⃣ Retrieval


Retrieval Flow

User Query
→ Query Embedding
→ Vector Search
→ Candidate Chunks
→ Ranking
→ Selected Context

Retrieval Types

Type Description
Vector search Semantic similarity
Keyword search Exact term matching
Hybrid search Vector + keyword
Metadata filtering Filter by source, date, permission
Graph retrieval Relationship-aware retrieval

为什么 Hybrid Search 常见?

Vector search 适合语义匹配。

Keyword search 适合精确词匹配。

Hybrid search 结合两者。


👉 面试回答

Retrieval 是为 user query 找到 relevant chunks 的过程。

很多 production RAG systems 使用 hybrid retrieval, 结合 vector search 的 semantic similarity 和 keyword search 的 exact matching。

Metadata filtering 对 permissions 和 freshness 也很重要。


🔟 Ranking and Re-ranking


为什么需要 Ranking?

Initial retrieval 可能返回 noisy results。

系统必须判断哪些 chunks 最有用。


Ranking Signals


Re-ranking Flow

Retrieve top 50 chunks
→ Re-ranker scores chunks
→ Select top 5 to 10 chunks
→ Add to prompt

👉 面试回答

Retrieval alone 通常不够。

Production RAG systems 通常会对 candidate chunks 进行 ranking 或 re-ranking, 再放入 prompt。

这样可以提高 relevance, 降低 noise, 并控制 context size。


1️⃣1️⃣ Prompt Building


Prompt Builder 做什么?

Prompt builder 会组合:


Prompt Structure

System instruction
User question
Retrieved context
Rules:
- Answer only from provided context
- Cite sources
- Say when context is insufficient
Output format

Important Rule

Model 应该知道哪些 context 是可信的。


👉 面试回答

Prompt builder 决定 retrieved knowledge 如何提供给 LLM。

它应该包含 user question、 relevant chunks、citation metadata、 output format, 以及如何处理 insufficient context 的 instructions。

好的 prompt construction 可以提升 factuality 和 consistency。


1️⃣2️⃣ Generation


Generation Step

LLM 接收 retrieved context 并生成回答。

Retrieved Context
+ User Question
+ Instructions
→ LLM Answer

Good RAG Answer Should


Failure Example

Retrieved context does not answer question
→ LLM guesses anyway

👉 面试回答

在 generation step 中, LLM 应该基于 retrieved context 回答。

如果 retrieved context 不足, 系统应该要求 model 明确说明, 而不是猜测。

这对减少 hallucination 很关键。


1️⃣3️⃣ Access Control


为什么 Access Control 重要?

Enterprise RAG 可能包含 sensitive documents。

用户只能 retrieve 他们有权限看到的 documents。


Access Control Points


Example

User from Team A
→ Retrieve only documents allowed for Team A

👉 面试回答

Access control 对 enterprise RAG 很关键。

Retrieval system 必须根据 user permissions 过滤 documents, 然后才能把 context 放入 prompt。

LLM 不应该接收到 unauthorized content。


1️⃣4️⃣ Evaluation


需要评估什么?

RAG systems 需要在多个层面评估。


Retrieval Metrics


Generation Metrics


Production Signals


👉 面试回答

RAG evaluation 应该同时衡量 retrieval quality 和 generation quality。

好的 retrieval 意味着找到了正确 documents。

好的 generation 意味着答案 faithful to those documents, 有 citations, 并避免 unsupported claims。


1️⃣5️⃣ Common Failure Modes


Failure Modes

RAG systems 可能因为这些原因失败:


Example

Wrong chunk retrieved
→ LLM answers confidently
→ User receives incorrect answer

👉 面试回答

RAG failures 可能发生在任何层: ingestion、chunking、embedding、 retrieval、ranking、prompt building、 generation 或 access control。

Debugging RAG 需要 trace 整个 pipeline。


1️⃣6️⃣ Best Practices


Practical Rules


Design Principle

RAG quality depends more on retrieval and context engineering
than on the LLM alone.

👉 面试回答

好的 RAG system 不只是 vector database 加 LLM。

它需要高质量 ingestion、chunking、 metadata、retrieval、ranking、 prompt building、access control、 evaluation 和 observability。


🧠 Staff-Level Answer Final


👉 面试回答完整版本

RAG,也就是 Retrieval-Augmented Generation, 是一种在 runtime 检索 external knowledge, 并把它提供给 LLM 作为 context 的架构。

它有价值, 因为 LLM 不会自动知道 private、recent 或 domain-specific information。

Production RAG system 通常有两条路径: offline ingestion path 和 online query path。

Offline path 负责 parse documents、 clean text、chunk documents、 generate embeddings, 并把 chunks 和 metadata 存储到 vector database 或 search index 中。

Online path 接收 user query, 可选地 rewrite query, embed query, retrieve candidate chunks, rank 或 re-rank, build prompt, 然后把 grounded context 发送给 LLM。

RAG 的质量高度依赖 ingestion 和 retrieval。

Bad chunking、weak metadata、 stale documents 或 poor retrieval, 即使 model 很强, 也可能导致错误答案。

在 production 中, 我通常会使用 hybrid retrieval, 把 vector search 的 semantic similarity 和 keyword search 的 exact matching 结合起来。

我也会存储 metadata, 比如 document ID、source、timestamp、 owner、access control tags 和 freshness signals。

在把 context 发送给 model 前, 系统应该按 permissions 过滤, rank results, 移除 irrelevant chunks, 并控制 context 在 token limits 内。

Prompt 应该要求 model 只基于 retrieved context 回答, cite sources, 并在 context insufficient 时明确说明。

RAG systems 还需要 evaluation 和 observability。

我会衡量 retrieval recall、precision、 relevance、permission correctness、 answer faithfulness、citation quality、 latency 和 cost。

核心点是: RAG 不是简单的 “vector database + LLM”。

它是一个完整的 knowledge system, 包括 ingestion、indexing、retrieval、 ranking、prompt building、generation、 access control、evaluation 和 monitoring。


⭐ Final Insight

RAG 的核心不是“把文档塞给 LLM”。

真正的 RAG Architecture 是:

Ingestion

  • Chunking
  • Embeddings
  • Vector / Hybrid Search
  • Ranking
  • Prompt Building
  • Generation
  • Citations
  • Evaluation
  • Access Control。

好的 RAG 系统, 质量主要取决于 retrieval 和 context engineering, 而不是只取决于 LLM 本身。


Implement