System Design Deep Dive - 01 Intro to LLM Systems

Post by ailswan May. 24, 2026

中文 ↓

🎯 Intro to LLM Systems

1️⃣ Core Framework

When discussing LLM Systems, I frame it as:

  1. Model layer
  2. Prompt and context layer
  3. Retrieval / RAG layer
  4. Tool calling layer
  5. Memory and personalization
  6. Evaluation and monitoring
  7. Safety and guardrails
  8. Trade-offs: quality vs latency vs cost

2️⃣ What Is an LLM System?

An LLM system is not just a model.

It is a full application architecture around a large language model.

User Input
→ Prompt Builder
→ Context Retrieval
→ LLM
→ Tool Calls
→ Response Parser
→ Safety Checks
→ Final Answer

👉 Interview Answer

An LLM system is an application built around a large language model.

The model generates language, but the system around it provides context, retrieves knowledge, calls tools, enforces safety, monitors quality, and controls cost and latency.


3️⃣ Core Components


Model

The LLM is responsible for:


Prompt

Prompt defines:


Context

Context is information given to the model at runtime.

Examples:


👉 Interview Answer

The LLM itself is only one component.

A production LLM system also needs prompt design, context management, retrieval, tool integration, output validation, monitoring, and safety controls.


4️⃣ Prompt Engineering


Prompt Structure

System instruction
→ Developer instruction
→ User input
→ Retrieved context
→ Output format
→ Constraints

Good Prompt Should Define


👉 Interview Answer

Prompt engineering is about controlling model behavior.

A good prompt clearly defines the task, constraints, context, and expected output format.

For production systems, prompts should be versioned, tested, and monitored like code.


5️⃣ Context Window


What Is Context Window?

The context window is the amount of text the model can see at once.

It includes:

system instructions
conversation history
retrieved documents
tool results
user input

Challenge

The model cannot remember unlimited information inside one request.

So we need:


👉 Interview Answer

The context window limits how much information the model can consider at once.

Since we cannot put everything into the prompt, the system must decide what context is most relevant, retrieve it, rank it, and compress or summarize it when needed.


6️⃣ RAG: Retrieval-Augmented Generation


What Is RAG?

RAG means:

Retrieve relevant knowledge
→ Add it to prompt
→ Let LLM generate answer

RAG Flow

User question
→ Convert question to embedding
→ Search vector database
→ Retrieve relevant chunks
→ Add chunks to prompt
→ LLM answers with context

Why RAG?

LLMs may not know:


👉 Interview Answer

RAG helps ground LLM responses in external knowledge.

Instead of relying only on model memory, the system retrieves relevant documents at runtime and provides them as context to the model.

This improves factuality and allows the LLM to answer questions about private or updated data.


7️⃣ Embeddings and Vector Search


Embedding

An embedding converts text into a vector.

"refund policy" → [0.12, -0.44, 0.89, ...]

Similar meanings have similar vectors.

query embedding
→ find nearest document embeddings

Used For


👉 Interview Answer

Embeddings allow semantic search.

The system converts documents and user queries into vectors, then finds the most similar document chunks using vector search.

This is the foundation of many RAG systems.


8️⃣ Tool Calling


Why Tool Calling?

LLMs cannot reliably do everything internally.

Tools can provide:


Tool Calling Flow

User asks question
→ LLM decides tool is needed
→ System calls tool
→ Tool returns result
→ LLM uses result to answer

👉 Interview Answer

Tool calling allows an LLM to interact with external systems.

The model decides what tool to call and with what arguments, while the application executes the tool safely and returns the result back to the model.

This turns the LLM from a text generator into an action-capable system.


9️⃣ Agents


What Is an Agent?

An agent is an LLM system that can:


Agent Loop

Goal
→ Plan
→ Tool call
→ Observe result
→ Decide next step
→ Final answer

When Agents Are Useful


👉 Interview Answer

An agent is an LLM system that can take multiple steps toward a goal.

It can plan, call tools, observe results, and continue until it reaches an answer or completes a task.

The challenge is controlling reliability, cost, latency, and safety.


🔟 Memory


Types of Memory

Short-term Memory

Conversation history.


Long-term Memory

Persistent user or domain information.


Working Memory

Temporary state during task execution.


Memory Challenges


👉 Interview Answer

Memory helps LLM systems personalize and continue work over time.

Short-term memory comes from conversation history.

Long-term memory stores persistent facts or preferences.

But memory must be managed carefully because it creates privacy, accuracy, and stale-information risks.


1️⃣1️⃣ Output Validation


Why Needed?

LLMs can produce:


Validation Techniques


👉 Interview Answer

Production LLM systems should not blindly trust model output.

The system should validate structure, check business rules, verify tool results, and retry or fallback when output is invalid.


1️⃣2️⃣ Evaluation


Offline Evaluation

Use test datasets.

Metrics:


Online Evaluation

Use production signals.

Metrics:


👉 Interview Answer

LLM systems need continuous evaluation.

Offline evaluation tests prompts and models before launch.

Online evaluation monitors real user outcomes, cost, latency, safety, and task success after launch.


1️⃣3️⃣ Safety and Guardrails


Safety Risks


Guardrails


👉 Interview Answer

Safety is a core part of LLM system design.

The system should protect against hallucination, prompt injection, data leakage, unsafe tool calls, and unauthorized actions.

Guardrails should be applied around both model input and model output.


1️⃣4️⃣ Latency and Cost


Main Cost Drivers


Optimization Strategies


👉 Interview Answer

LLM systems must optimize both latency and cost.

Long prompts, large models, tool calls, and multi-step agents increase cost and latency.

A good system routes simple tasks to cheaper models and reserves expensive reasoning models for harder tasks.


1️⃣5️⃣ Observability


What to Log


Why Important?


👉 Interview Answer

Observability is critical for LLM systems.

I would log prompt version, model version, retrieval results, tool calls, latency, token usage, cost, validation errors, and user feedback.

This makes the system debuggable and improvable.


1️⃣6️⃣ Common Architectures


Simple Chatbot

User input
→ Prompt
→ LLM
→ Response

Good for simple Q&A.


RAG Chatbot

User input
→ Retriever
→ Documents
→ Prompt
→ LLM
→ Answer with sources

Good for enterprise knowledge.


Tool-using Assistant

User input
→ LLM
→ Tool call
→ Tool result
→ LLM
→ Final answer

Good for workflows.


Agentic System

Goal
→ Planner
→ Tools
→ Memory
→ Reflection / validation
→ Final result

Good for multi-step tasks.


1️⃣7️⃣ End-to-End Flow


RAG Flow

User asks question
→ System rewrites query if needed
→ Retrieve relevant documents
→ Rank and filter context
→ Build prompt
→ LLM generates answer
→ Validate answer
→ Return with citations

Tool Flow

User asks task
→ LLM identifies needed tool
→ System validates permission
→ Tool executes
→ Result returned to LLM
→ LLM summarizes result

Agent Flow

User gives goal
→ Agent plans steps
→ Agent calls tools
→ Agent observes results
→ Agent revises plan
→ Agent returns final answer

🧠 Staff-Level Answer Final


👉 Interview Answer Full Version

When designing LLM systems, I do not think of the model alone.

I think of the full application architecture around the model.

The LLM is responsible for language understanding and generation, but the system must provide context, retrieve knowledge, call tools, validate outputs, enforce safety, monitor quality, and control cost and latency.

A basic LLM system takes user input, builds a prompt, calls the model, and returns a response.

More advanced systems use RAG, where the system retrieves relevant documents and adds them to the prompt so the model can answer using external knowledge.

For actions, the system can use tool calling, where the model decides which tool to use and the application executes that tool safely.

For multi-step tasks, we can build agents that plan, call tools, observe results, and iterate.

However, LLM systems need strong guardrails. The system must handle hallucination, prompt injection, data leakage, invalid outputs, unsafe actions, and privacy risks.

Production systems should validate model outputs, log prompt and model versions, evaluate quality offline and online, and monitor latency, cost, safety, and user satisfaction.

The key trade-offs are quality, latency, cost, reliability, and safety.

Ultimately, the goal is to build a system that uses the LLM as a reasoning and language component, while the surrounding architecture provides grounding, control, safety, and reliability.


⭐ Final Insight

LLM System 的核心不是“调用一个模型”, 而是把 model、prompt、context、RAG、tools、memory、evaluation、safety 组合成一个可控、可靠、可评估的应用系统。



中文部分


🎯 Intro to LLM Systems


1️⃣ 核心框架

在讨论 LLM Systems 时,我通常从以下几个方面分析:

  1. Model layer
  2. Prompt and context layer
  3. Retrieval / RAG layer
  4. Tool calling layer
  5. Memory and personalization
  6. Evaluation and monitoring
  7. Safety and guardrails
  8. 核心权衡:quality vs latency vs cost

2️⃣ 什么是 LLM System?

LLM system 不只是一个模型。

它是围绕 large language model 构建的一整套应用架构。

User Input
→ Prompt Builder
→ Context Retrieval
→ LLM
→ Tool Calls
→ Response Parser
→ Safety Checks
→ Final Answer

👉 面试回答

LLM system 是围绕 large language model 构建的应用系统。

Model 负责生成语言, 但 model 周围的系统负责提供 context、 检索知识、调用 tools、执行 safety、 监控质量,并控制 cost 和 latency。


3️⃣ 核心组件


Model

LLM 负责:


Prompt

Prompt 定义:


Context

Context 是 runtime 提供给模型的信息。

例如:


👉 面试回答

LLM 本身只是一个组件。

Production LLM system 还需要 prompt design、 context management、retrieval、tool integration、 output validation、monitoring 和 safety controls。


4️⃣ Prompt Engineering


Prompt Structure

System instruction
→ Developer instruction
→ User input
→ Retrieved context
→ Output format
→ Constraints

Good Prompt Should Define


👉 面试回答

Prompt engineering 是控制 model behavior 的方法。

好的 prompt 会清楚定义 task、constraints、 context 和 expected output format。

对 production systems 来说, prompts 应该像代码一样 versioned、tested 和 monitored。


5️⃣ Context Window


什么是 Context Window?

Context window 是模型一次请求中能看到的文本量。

它包括:

system instructions
conversation history
retrieved documents
tool results
user input

Challenge

模型不能在一个 request 中记住无限信息。

所以需要:


👉 面试回答

Context window 限制了模型一次能考虑多少信息。

因为不能把所有内容都塞进 prompt, 系统必须决定哪些 context 最相关, 检索它们、排序它们, 并在必要时压缩或总结。


6️⃣ RAG: Retrieval-Augmented Generation


什么是 RAG?

RAG 表示:

Retrieve relevant knowledge
→ Add it to prompt
→ Let LLM generate answer

RAG Flow

User question
→ Convert question to embedding
→ Search vector database
→ Retrieve relevant chunks
→ Add chunks to prompt
→ LLM answers with context

为什么需要 RAG?

LLM 可能不知道:


👉 面试回答

RAG 可以让 LLM 的回答基于外部知识。

系统不只依赖 model memory, 而是在 runtime 检索相关 documents, 并把它们作为 context 提供给模型。

这样可以提升 factuality, 也让 LLM 能回答 private data 或 updated data 相关问题。


7️⃣ Embeddings and Vector Search


Embedding

Embedding 将文本转换成向量。

"refund policy" → [0.12, -0.44, 0.89, ...]

Similarity Search

语义相近的文本,向量也相近。

query embedding
→ find nearest document embeddings

Used For


👉 面试回答

Embeddings 让 semantic search 成为可能。

系统将 documents 和 user queries 转换为 vectors, 然后用 vector search 找到最相似的 document chunks。

这是很多 RAG systems 的基础。


8️⃣ Tool Calling


为什么需要 Tool Calling?

LLM 不能可靠地在内部完成所有事情。

Tools 可以提供:


Tool Calling Flow

User asks question
→ LLM decides tool is needed
→ System calls tool
→ Tool returns result
→ LLM uses result to answer

👉 面试回答

Tool calling 让 LLM 可以和外部系统交互。

Model 决定调用哪个 tool 以及传什么 arguments, application 安全地执行 tool, 然后把结果返回给 model。

这样 LLM 就不只是 text generator, 而是可以执行动作的系统。


9️⃣ Agents


什么是 Agent?

Agent 是一种可以执行以下能力的 LLM system:


Agent Loop

Goal
→ Plan
→ Tool call
→ Observe result
→ Decide next step
→ Final answer

When Agents Are Useful


👉 面试回答

Agent 是可以朝着目标执行多步操作的 LLM system。

它可以 plan、call tools、observe results, 并继续执行直到得到答案或完成任务。

难点是控制 reliability、cost、latency 和 safety。


🔟 Memory


Types of Memory

Short-term Memory

Conversation history.


Long-term Memory

Persistent user or domain information.


Working Memory

任务执行过程中的临时状态。


Memory Challenges


👉 面试回答

Memory 帮助 LLM systems 个性化, 并在多次交互中延续任务。

Short-term memory 来自 conversation history。

Long-term memory 存储持久 facts 或 preferences。

但 memory 必须谨慎管理, 因为它会带来 privacy、accuracy 和 stale-information risks。


1️⃣1️⃣ Output Validation


为什么需要?

LLM 可能产生:


Validation Techniques


👉 面试回答

Production LLM systems 不能盲目信任 model output。

系统应该验证 structure, 检查 business rules, 验证 tool results, 并在 output invalid 时 retry 或 fallback。


1️⃣2️⃣ Evaluation


Offline Evaluation

使用 test datasets。

Metrics:


Online Evaluation

使用 production signals。

Metrics:


👉 面试回答

LLM systems 需要 continuous evaluation。

Offline evaluation 在 launch 前测试 prompts 和 models。

Online evaluation 在 launch 后监控真实用户结果、 cost、latency、safety 和 task success。


1️⃣3️⃣ Safety and Guardrails


Safety Risks


Guardrails


👉 面试回答

Safety 是 LLM system design 的核心部分。

系统应该防范 hallucination、prompt injection、 data leakage、unsafe tool calls 和 unauthorized actions。

Guardrails 应该同时应用在 model input 和 model output 周围。


1️⃣4️⃣ Latency and Cost


Main Cost Drivers


Optimization Strategies


👉 面试回答

LLM systems 必须优化 latency 和 cost。

Long prompts、大模型、tool calls 和 multi-step agents 都会增加 cost 和 latency。

好的系统会把简单任务路由到更便宜的模型, 把 expensive reasoning models 留给更难任务。


1️⃣5️⃣ Observability


What to Log


Why Important?


👉 面试回答

Observability 对 LLM systems 很关键。

我会记录 prompt version、model version、 retrieval results、tool calls、latency、 token usage、cost、validation errors 和 user feedback。

这样系统才可以 debug 和持续改进。


1️⃣6️⃣ Common Architectures


Simple Chatbot

User input
→ Prompt
→ LLM
→ Response

适合简单 Q&A。


RAG Chatbot

User input
→ Retriever
→ Documents
→ Prompt
→ LLM
→ Answer with sources

适合 enterprise knowledge。


Tool-using Assistant

User input
→ LLM
→ Tool call
→ Tool result
→ LLM
→ Final answer

适合 workflows。


Agentic System

Goal
→ Planner
→ Tools
→ Memory
→ Reflection / validation
→ Final result

适合 multi-step tasks。


1️⃣7️⃣ End-to-End Flow


RAG Flow

User asks question
→ System rewrites query if needed
→ Retrieve relevant documents
→ Rank and filter context
→ Build prompt
→ LLM generates answer
→ Validate answer
→ Return with citations

Tool Flow

User asks task
→ LLM identifies needed tool
→ System validates permission
→ Tool executes
→ Result returned to LLM
→ LLM summarizes result

Agent Flow

User gives goal
→ Agent plans steps
→ Agent calls tools
→ Agent observes results
→ Agent revises plan
→ Agent returns final answer

🧠 Staff-Level Answer Final


👉 面试回答完整版本

在设计 LLM systems 时, 我不会只考虑 model 本身。

我会考虑围绕 model 构建的完整 application architecture。

LLM 负责 language understanding 和 generation, 但系统必须提供 context、检索知识、调用 tools、 验证 outputs、执行 safety、监控质量, 并控制 cost 和 latency。

一个基础 LLM system 会接收 user input, 构建 prompt, 调用 model, 然后返回 response。

更高级的系统会使用 RAG, 也就是检索相关 documents, 把它们加入 prompt, 让 model 基于外部知识回答。

对于 actions, 系统可以使用 tool calling, 让 model 决定调用哪个 tool, 由 application 安全执行 tool。

对于 multi-step tasks, 可以构建 agents, 让它们 plan、call tools、observe results 并迭代执行。

但是 LLM systems 需要强 guardrails。 系统必须处理 hallucination、prompt injection、 data leakage、invalid outputs、unsafe actions 和 privacy risks。

Production systems 应该验证 model outputs, 记录 prompt 和 model versions, 做 offline 和 online evaluation, 并监控 latency、cost、safety 和 user satisfaction。

核心权衡包括 quality、latency、cost、 reliability 和 safety。

最终目标是把 LLM 作为 reasoning 和 language component, 同时用外围架构提供 grounding、control、 safety 和 reliability。


⭐ Final Insight

LLM System 的核心不是“调用一个模型”, 而是把 model、prompt、context、RAG、tools、memory、evaluation、safety 组合成一个可控、可靠、可评估的应用系统。

Implement