aaa-rag RAG & Knowledge Systems ·

🎯 RAG Architecture Explained for Engineers

1️⃣ Core Framework

When discussing RAG Architecture, I frame it as:

Why RAG is needed
Knowledge ingestion
Chunking and embeddings
Vector storage and indexing
Query rewriting and retrieval
Ranking and context building
Generation with citations
Trade-offs: accuracy vs latency vs cost

2️⃣ What Is RAG?

RAG means Retrieval-Augmented Generation.

It combines:

External Knowledge Retrieval
+ LLM Generation

Instead of asking the LLM to answer only from model memory, the system retrieves relevant knowledge at runtime and gives it to the model as context.

Basic Flow

User Question
→ Retrieve relevant documents
→ Add documents to prompt
→ LLM generates grounded answer

👉 Interview Answer

RAG is an architecture where the system retrieves relevant external knowledge at runtime and provides that knowledge to the LLM as context.

This helps the model answer questions using private, updated, or domain-specific information instead of relying only on model memory.

3️⃣ Why Do We Need RAG?

LLM Limitations

LLMs may not know:

Private company data
Recent information
Internal documentation
Customer-specific data
Domain-specific policies
Large knowledge bases

Without RAG

User asks about internal policy
→ LLM guesses
→ Risk of hallucination

With RAG

User asks about internal policy
→ System retrieves policy document
→ LLM answers using retrieved context

👉 Interview Answer

RAG is useful because LLMs do not automatically know private or updated information.

By retrieving relevant documents at runtime, the system can ground the answer in real sources, reduce hallucination, and support enterprise knowledge use cases.

4️⃣ High-Level RAG Architecture

Architecture

Documents
→ Ingestion Pipeline
→ Chunking
→ Embedding Model
→ Vector Database
→ Retriever
→ Ranker
→ Prompt Builder
→ LLM
→ Answer with Sources

Two Main Paths

Offline Path

Documents
→ Clean
→ Chunk
→ Embed
→ Store

Online Path

User Query
→ Embed Query
→ Retrieve Chunks
→ Rank Results
→ Build Prompt
→ Generate Answer

👉 Interview Answer

RAG usually has two paths.

The offline ingestion path processes documents, chunks them, creates embeddings, and stores them in a search index or vector database.

The online query path retrieves relevant chunks, ranks them, builds the prompt, and sends the grounded context to the LLM.

5️⃣ Ingestion Pipeline

What Is Ingestion?

Ingestion is the process of preparing knowledge for retrieval.

Ingestion Steps

Raw Documents
→ Parse
→ Clean
→ Normalize
→ Chunk
→ Embed
→ Store

Input Sources

PDFs
Markdown files
Web pages
Internal docs
Tickets
Incident reports
Code repositories
Database records

Why Ingestion Matters

Bad ingestion leads to bad retrieval.

👉 Interview Answer

The ingestion pipeline prepares documents for RAG.

It parses raw data, cleans text, splits documents into chunks, generates embeddings, and stores them in a retrievable index.

Ingestion quality directly affects retrieval quality.

6️⃣ Chunking

What Is Chunking?

Chunking means splitting large documents into smaller pieces.

Long document
→ Chunk 1
→ Chunk 2
→ Chunk 3

Why Chunking Is Needed

LLMs and retrievers work better with focused pieces of text.

Chunking Strategies

Strategy	Use Case
Fixed-size chunking	Simple documents
Semantic chunking	Structured knowledge
Section-based chunking	Markdown / docs
Sliding window	Preserve overlap
Code-aware chunking	Code repositories

Chunk Size Trade-off

Chunk Size	Strength	Weakness
Small chunks	Precise retrieval	May lose context
Large chunks	More context	Less precise and more costly

👉 Interview Answer

Chunking is important because retrieval happens at the chunk level.

If chunks are too small, the system may lose context.

If chunks are too large, retrieval becomes less precise and prompts become expensive.

Good chunking balances precision and context.

7️⃣ Embeddings

What Is an Embedding?

An embedding converts text into a vector.

"refund policy"
→ [0.12, -0.45, 0.89, ...]

Texts with similar meaning have similar vectors.

Embedding Flow

Document chunk
→ Embedding model
→ Vector
→ Store in vector database

Query Embedding

User query
→ Embedding model
→ Query vector
→ Search similar document vectors

👉 Interview Answer

Embeddings allow semantic search.

The system converts document chunks and user queries into vectors, then searches for chunks with similar meaning.

This is the foundation of vector-based RAG retrieval.

8️⃣ Vector Database

What Does Vector DB Store?

A vector database stores:

Chunk text
Embedding vector
Document ID
Metadata
Source URL
Timestamp
Access control tags

Example Record

{
  "chunk_id": "chunk_123",
  "document_id": "doc_456",
  "text": "Refunds are allowed within 30 days...",
  "embedding": [0.12, -0.45, 0.89],
  "metadata": {
    "source": "refund_policy.md",
    "updated_at": "2026-05-24",
    "department": "support"
  }
}

Why Metadata Matters

Metadata supports:

Filtering
Access control
Freshness ranking
Citation generation
Debugging

👉 Interview Answer

A vector database stores embeddings, document chunks, and metadata.

Metadata is important because it enables filtering, access control, freshness checks, citation generation, and debugging.

9️⃣ Retrieval

Retrieval Flow

User Query
→ Query Embedding
→ Vector Search
→ Candidate Chunks
→ Ranking
→ Selected Context

Retrieval Types

Type	Description
Vector search	Semantic similarity
Keyword search	Exact term matching
Hybrid search	Vector + keyword
Metadata filtering	Filter by source, date, permission
Graph retrieval	Relationship-aware retrieval

Why Hybrid Search Is Common

Vector search is good for meaning.

Keyword search is good for exact terms.

Hybrid search combines both.

👉 Interview Answer

Retrieval is the process of finding relevant chunks for the user query.

Many production RAG systems use hybrid retrieval, combining vector search for semantic similarity with keyword search for exact matches.

Metadata filtering is also important for permissions and freshness.

🔟 Ranking and Re-ranking

Why Ranking Is Needed

Initial retrieval may return noisy results.

The system must decide which chunks are most useful.

Ranking Signals

Semantic relevance
Keyword match
Freshness
Source authority
User permission
Document type
Historical usefulness

Re-ranking Flow

Retrieve top 50 chunks
→ Re-ranker scores chunks
→ Select top 5 to 10 chunks
→ Add to prompt

👉 Interview Answer

Retrieval alone is often not enough.

Production RAG systems usually rank or re-rank candidate chunks before adding them to the prompt.

This improves relevance, reduces noise, and controls context size.

1️⃣1️⃣ Prompt Building

What Prompt Builder Does

The prompt builder combines:

System instruction
User question
Retrieved context
Citation metadata
Output format
Safety constraints

Prompt Structure

System instruction
User question
Retrieved context
Rules:
- Answer only from provided context
- Cite sources
- Say when context is insufficient
Output format

Important Rule

The model should know what context it can trust.

👉 Interview Answer

The prompt builder decides how retrieved knowledge is presented to the LLM.

It should include the user question, relevant chunks, citation metadata, output format, and instructions for handling insufficient context.

Good prompt construction improves factuality and consistency.

1️⃣2️⃣ Generation

Generation Step

The LLM receives retrieved context and generates the answer.

Retrieved Context
+ User Question
+ Instructions
→ LLM Answer

Good RAG Answer Should

Use retrieved context
Avoid unsupported claims
Cite sources
Mention uncertainty
Say when context is insufficient
Avoid hallucination

Failure Example

Retrieved context does not answer question
→ LLM guesses anyway

👉 Interview Answer

In the generation step, the LLM should answer using retrieved context.

If the retrieved context is insufficient, the system should instruct the model to say so instead of guessing.

This is critical for reducing hallucination.

1️⃣3️⃣ Access Control

Why Access Control Matters

Enterprise RAG may contain sensitive documents.

Users should only retrieve documents they are allowed to see.

Access Control Points

During ingestion
During retrieval filtering
During prompt construction
During citation display
During logging

Example

User from Team A
→ Retrieve only documents allowed for Team A

👉 Interview Answer

Access control is critical in enterprise RAG.

The retrieval system must filter documents based on user permissions before adding context to the prompt.

The LLM should never receive unauthorized content.

1️⃣4️⃣ Evaluation

What to Evaluate

RAG systems need evaluation at multiple layers.

Retrieval Metrics

Recall
Precision
Relevance
Source freshness
Permission correctness

Generation Metrics

Factuality
Faithfulness
Citation quality
Answer completeness
Refusal when context is insufficient

Production Signals

User feedback
Click-through on citations
Escalation rate
Latency
Cost
Error rate

👉 Interview Answer

RAG evaluation should measure both retrieval quality and generation quality.

Good retrieval means the right documents are found.

Good generation means the answer is faithful to those documents, cites sources, and avoids unsupported claims.

1️⃣5️⃣ Common Failure Modes

Failure Modes

RAG systems can fail because:

Bad ingestion
Poor chunking
Weak embeddings
Wrong retrieval
Missing metadata
Stale documents
Permission leaks
Too much context
LLM ignores context
No evaluation loop

Example

Wrong chunk retrieved
→ LLM answers confidently
→ User receives incorrect answer

👉 Interview Answer

RAG failures can happen at any layer: ingestion, chunking, embedding, retrieval, ranking, prompt building, generation, or access control.

Debugging RAG requires tracing the full pipeline.

1️⃣6️⃣ Best Practices

Practical Rules

Clean documents before indexing
Use good chunking strategy
Store metadata with chunks
Use hybrid retrieval when needed
Re-rank candidate chunks
Enforce access control before prompt building
Cite sources
Evaluate retrieval and generation separately
Log retrieved chunks and prompt versions
Refresh stale documents

Design Principle

RAG quality depends more on retrieval and context engineering
than on the LLM alone.

👉 Interview Answer

A good RAG system is not just a vector database plus an LLM.

It requires high-quality ingestion, chunking, metadata, retrieval, ranking, prompt building, access control, evaluation, and observability.

🧠 Staff-Level Answer Final

👉 Interview Answer Full Version

RAG, or Retrieval-Augmented Generation, is an architecture where the system retrieves relevant external knowledge at runtime and provides it to the LLM as context.

This is useful because LLMs do not automatically know private, recent, or domain-specific information.

A production RAG system usually has two paths: an offline ingestion path and an online query path.

The offline path parses documents, cleans text, chunks documents, generates embeddings, and stores chunks with metadata in a vector database or search index.

The online path receives the user query, optionally rewrites it, embeds it, retrieves candidate chunks, ranks or re-ranks them, builds the prompt, and sends the grounded context to the LLM.

The quality of RAG depends heavily on ingestion and retrieval.

Bad chunking, weak metadata, stale documents, or poor retrieval can cause the LLM to produce incorrect answers even if the model is strong.

In production, I would usually use hybrid retrieval, combining vector search for semantic similarity with keyword search for exact matches.

I would also store metadata such as document ID, source, timestamp, owner, access control tags, and freshness signals.

Before sending context to the model, the system should filter by permissions, rank results, remove irrelevant chunks, and keep context within token limits.

The prompt should instruct the model to answer only from retrieved context, cite sources, and say when the context is insufficient.

RAG systems also need evaluation and observability.

I would measure retrieval recall, precision, relevance, permission correctness, answer faithfulness, citation quality, latency, and cost.

The key point is that RAG is not just “vector database plus LLM.”

It is a full knowledge system with ingestion, indexing, retrieval, ranking, prompt building, generation, access control, evaluation, and monitoring.

⭐ Final Insight

RAG 的核心不是“把文档塞给 LLM”。

真正的 RAG Architecture 是：

Ingestion

Chunking

Embeddings

Vector / Hybrid Search

Ranking

Prompt Building

Generation

Citations

Evaluation

Access Control。

好的 RAG 系统，质量主要取决于 retrieval 和 context engineering，而不是只取决于 LLM 本身。

中文部分

🎯 RAG Architecture Explained for Engineers

1️⃣ 核心框架

讨论 RAG Architecture 时，我通常从这些方面分析：

为什么需要 RAG
Knowledge ingestion
Chunking and embeddings
Vector storage and indexing
Query rewriting and retrieval
Ranking and context building
Generation with citations
核心权衡：accuracy vs latency vs cost

2️⃣ 什么是 RAG？

RAG 表示 Retrieval-Augmented Generation。

它结合了：

External Knowledge Retrieval
+ LLM Generation

不是让 LLM 只依赖 model memory 回答，而是在 runtime 检索相关知识，再把这些知识作为 context 提供给 model。

Basic Flow

User Question
→ Retrieve relevant documents
→ Add documents to prompt
→ LLM generates grounded answer

👉 面试回答

RAG 是一种在 runtime 检索 external knowledge，并把这些知识作为 context 提供给 LLM 的架构。

它让模型可以基于 private、updated 或 domain-specific information 回答问题，而不是只依赖 model memory。

3️⃣ 为什么需要 RAG？

LLM Limitations

LLM 可能不知道：

Private company data
Recent information
Internal documentation
Customer-specific data
Domain-specific policies
Large knowledge bases

Without RAG

User asks about internal policy
→ LLM guesses
→ Risk of hallucination

With RAG

User asks about internal policy
→ System retrieves policy document
→ LLM answers using retrieved context

👉 面试回答

RAG 有价值，因为 LLM 不会自动知道 private 或 updated information。

通过在 runtime 检索相关 documents，系统可以让答案基于真实来源，降低 hallucination，并支持 enterprise knowledge use cases。

4️⃣ High-Level RAG Architecture

Architecture

Documents
→ Ingestion Pipeline
→ Chunking
→ Embedding Model
→ Vector Database
→ Retriever
→ Ranker
→ Prompt Builder
→ LLM
→ Answer with Sources

Two Main Paths

Offline Path

Documents
→ Clean
→ Chunk
→ Embed
→ Store

Online Path

User Query
→ Embed Query
→ Retrieve Chunks
→ Rank Results
→ Build Prompt
→ Generate Answer

👉 面试回答

RAG 通常有两条路径： offline ingestion path 和 online query path。

Offline path 负责处理 documents、 chunking、embedding，并把结果存到 search index 或 vector database。

Online path 负责检索相关 chunks、 ranking、prompt building，并把 grounded context 发送给 LLM。

5️⃣ Ingestion Pipeline

什么是 Ingestion？

Ingestion 是把 knowledge 准备成可检索形式的过程。

Ingestion Steps

Raw Documents
→ Parse
→ Clean
→ Normalize
→ Chunk
→ Embed
→ Store

Input Sources

PDFs
Markdown files
Web pages
Internal docs
Tickets
Incident reports
Code repositories
Database records

为什么 Ingestion 重要？

Bad ingestion 会导致 bad retrieval。

👉 面试回答

Ingestion pipeline 负责为 RAG 准备 documents。

它会 parse raw data、clean text、 split documents into chunks、 generate embeddings，并把它们存入 retrievable index。

Ingestion quality 会直接影响 retrieval quality。

6️⃣ Chunking

什么是 Chunking？

Chunking 是把大文档切成更小片段。

Long document
→ Chunk 1
→ Chunk 2
→ Chunk 3

为什么需要 Chunking？

LLM 和 retriever 更适合处理聚焦的小文本块。

Chunking Strategies

Strategy	Use Case
Fixed-size chunking	Simple documents
Semantic chunking	Structured knowledge
Section-based chunking	Markdown / docs
Sliding window	Preserve overlap
Code-aware chunking	Code repositories

Chunk Size Trade-off

Chunk Size	优点	缺点
Small chunks	Precise retrieval	May lose context
Large chunks	More context	Less precise and more costly

👉 面试回答

Chunking 很重要，因为 retrieval 是在 chunk level 发生的。

Chunk 太小会丢失 context。

Chunk 太大会降低 retrieval precision，并增加 prompt cost。

好的 chunking 需要平衡 precision 和 context。

7️⃣ Embeddings

什么是 Embedding？

Embedding 把文本转换成向量。

"refund policy"
→ [0.12, -0.45, 0.89, ...]

语义相近的文本，向量也相近。

Embedding Flow

Document chunk
→ Embedding model
→ Vector
→ Store in vector database

Query Embedding

User query
→ Embedding model
→ Query vector
→ Search similar document vectors

👉 面试回答

Embeddings 让 semantic search 成为可能。

系统把 document chunks 和 user queries 转换成 vectors，然后搜索语义相近的 chunks。

这是 vector-based RAG retrieval 的基础。

8️⃣ Vector Database

Vector DB 存什么？

Vector database 通常存储：

Chunk text
Embedding vector
Document ID
Metadata
Source URL
Timestamp
Access control tags

Example Record

{
  "chunk_id": "chunk_123",
  "document_id": "doc_456",
  "text": "Refunds are allowed within 30 days...",
  "embedding": [0.12, -0.45, 0.89],
  "metadata": {
    "source": "refund_policy.md",
    "updated_at": "2026-05-24",
    "department": "support"
  }
}

为什么 Metadata 重要？

Metadata 支持：

Filtering
Access control
Freshness ranking
Citation generation
Debugging

👉 面试回答

Vector database 存储 embeddings、 document chunks 和 metadata。

Metadata 很重要，因为它支持 filtering、access control、 freshness checks、citation generation 和 debugging。

9️⃣ Retrieval

Retrieval Flow

User Query
→ Query Embedding
→ Vector Search
→ Candidate Chunks
→ Ranking
→ Selected Context

Retrieval Types

Type	Description
Vector search	Semantic similarity
Keyword search	Exact term matching
Hybrid search	Vector + keyword
Metadata filtering	Filter by source, date, permission
Graph retrieval	Relationship-aware retrieval

为什么 Hybrid Search 常见？

Vector search 适合语义匹配。

Keyword search 适合精确词匹配。

Hybrid search 结合两者。

👉 面试回答

Retrieval 是为 user query 找到 relevant chunks 的过程。

很多 production RAG systems 使用 hybrid retrieval，结合 vector search 的 semantic similarity 和 keyword search 的 exact matching。

Metadata filtering 对 permissions 和 freshness 也很重要。

🔟 Ranking and Re-ranking

为什么需要 Ranking？

Initial retrieval 可能返回 noisy results。

系统必须判断哪些 chunks 最有用。

Ranking Signals

Semantic relevance
Keyword match
Freshness
Source authority
User permission
Document type
Historical usefulness

Re-ranking Flow

Retrieve top 50 chunks
→ Re-ranker scores chunks
→ Select top 5 to 10 chunks
→ Add to prompt

👉 面试回答

Retrieval alone 通常不够。

Production RAG systems 通常会对 candidate chunks 进行 ranking 或 re-ranking，再放入 prompt。

这样可以提高 relevance，降低 noise，并控制 context size。

1️⃣1️⃣ Prompt Building

Prompt Builder 做什么？

Prompt builder 会组合：

System instruction
User question
Retrieved context
Citation metadata
Output format
Safety constraints

Prompt Structure

System instruction
User question
Retrieved context
Rules:
- Answer only from provided context
- Cite sources
- Say when context is insufficient
Output format

Important Rule

Model 应该知道哪些 context 是可信的。

👉 面试回答

Prompt builder 决定 retrieved knowledge 如何提供给 LLM。

它应该包含 user question、 relevant chunks、citation metadata、 output format，以及如何处理 insufficient context 的 instructions。

好的 prompt construction 可以提升 factuality 和 consistency。

1️⃣2️⃣ Generation

Generation Step

LLM 接收 retrieved context 并生成回答。

Retrieved Context
+ User Question
+ Instructions
→ LLM Answer

Good RAG Answer Should

Use retrieved context
Avoid unsupported claims
Cite sources
Mention uncertainty
Say when context is insufficient
Avoid hallucination

Failure Example

Retrieved context does not answer question
→ LLM guesses anyway

👉 面试回答

在 generation step 中， LLM 应该基于 retrieved context 回答。

如果 retrieved context 不足，系统应该要求 model 明确说明，而不是猜测。

这对减少 hallucination 很关键。

1️⃣3️⃣ Access Control

为什么 Access Control 重要？

Enterprise RAG 可能包含 sensitive documents。

用户只能 retrieve 他们有权限看到的 documents。

Access Control Points

During ingestion
During retrieval filtering
During prompt construction
During citation display
During logging

Example

User from Team A
→ Retrieve only documents allowed for Team A

👉 面试回答

Access control 对 enterprise RAG 很关键。

Retrieval system 必须根据 user permissions 过滤 documents，然后才能把 context 放入 prompt。

LLM 不应该接收到 unauthorized content。

1️⃣4️⃣ Evaluation

需要评估什么？

RAG systems 需要在多个层面评估。

Retrieval Metrics

Recall
Precision
Relevance
Source freshness
Permission correctness

Generation Metrics

Factuality
Faithfulness
Citation quality
Answer completeness
Refusal when context is insufficient

Production Signals

User feedback
Click-through on citations
Escalation rate
Latency
Cost
Error rate

👉 面试回答

RAG evaluation 应该同时衡量 retrieval quality 和 generation quality。

好的 retrieval 意味着找到了正确 documents。

好的 generation 意味着答案 faithful to those documents，有 citations，并避免 unsupported claims。

1️⃣5️⃣ Common Failure Modes

Failure Modes

RAG systems 可能因为这些原因失败：

Bad ingestion
Poor chunking
Weak embeddings
Wrong retrieval
Missing metadata
Stale documents
Permission leaks
Too much context
LLM ignores context
No evaluation loop

Example

Wrong chunk retrieved
→ LLM answers confidently
→ User receives incorrect answer

👉 面试回答

RAG failures 可能发生在任何层： ingestion、chunking、embedding、 retrieval、ranking、prompt building、 generation 或 access control。

Debugging RAG 需要 trace 整个 pipeline。

1️⃣6️⃣ Best Practices

Practical Rules

Clean documents before indexing
Use good chunking strategy
Store metadata with chunks
Use hybrid retrieval when needed
Re-rank candidate chunks
Enforce access control before prompt building
Cite sources
Evaluate retrieval and generation separately
Log retrieved chunks and prompt versions
Refresh stale documents

Design Principle

RAG quality depends more on retrieval and context engineering
than on the LLM alone.

👉 面试回答

好的 RAG system 不只是 vector database 加 LLM。

它需要高质量 ingestion、chunking、 metadata、retrieval、ranking、 prompt building、access control、 evaluation 和 observability。

🧠 Staff-Level Answer Final

👉 面试回答完整版本

RAG，也就是 Retrieval-Augmented Generation，是一种在 runtime 检索 external knowledge，并把它提供给 LLM 作为 context 的架构。

它有价值，因为 LLM 不会自动知道 private、recent 或 domain-specific information。

Production RAG system 通常有两条路径： offline ingestion path 和 online query path。

Offline path 负责 parse documents、 clean text、chunk documents、 generate embeddings，并把 chunks 和 metadata 存储到 vector database 或 search index 中。

Online path 接收 user query，可选地 rewrite query， embed query， retrieve candidate chunks， rank 或 re-rank， build prompt，然后把 grounded context 发送给 LLM。

RAG 的质量高度依赖 ingestion 和 retrieval。

Bad chunking、weak metadata、 stale documents 或 poor retrieval，即使 model 很强，也可能导致错误答案。

在 production 中，我通常会使用 hybrid retrieval，把 vector search 的 semantic similarity 和 keyword search 的 exact matching 结合起来。

我也会存储 metadata，比如 document ID、source、timestamp、 owner、access control tags 和 freshness signals。

在把 context 发送给 model 前，系统应该按 permissions 过滤， rank results，移除 irrelevant chunks，并控制 context 在 token limits 内。

Prompt 应该要求 model 只基于 retrieved context 回答， cite sources，并在 context insufficient 时明确说明。

RAG systems 还需要 evaluation 和 observability。

我会衡量 retrieval recall、precision、 relevance、permission correctness、 answer faithfulness、citation quality、 latency 和 cost。

核心点是： RAG 不是简单的 “vector database + LLM”。

它是一个完整的 knowledge system，包括 ingestion、indexing、retrieval、 ranking、prompt building、generation、 access control、evaluation 和 monitoring。

⭐ Final Insight

RAG 的核心不是“把文档塞给 LLM”。

真正的 RAG Architecture 是：

Ingestion

Chunking

Embeddings

Vector / Hybrid Search

Ranking

Prompt Building

Generation

Citations

Evaluation

Access Control。

好的 RAG 系统，质量主要取决于 retrieval 和 context engineering，而不是只取决于 LLM 本身。