🎯 Intro to LLM Systems
1️⃣ Core Framework
When discussing LLM Systems, I frame it as:
- Model layer
- Prompt and context layer
- Retrieval / RAG layer
- Tool calling layer
- Memory and personalization
- Evaluation and monitoring
- Safety and guardrails
- Trade-offs: quality vs latency vs cost
2️⃣ What Is an LLM System?
An LLM system is not just a model.
It is a full application architecture around a large language model.
User Input
→ Prompt Builder
→ Context Retrieval
→ LLM
→ Tool Calls
→ Response Parser
→ Safety Checks
→ Final Answer
👉 Interview Answer
An LLM system is an application built around a large language model.
The model generates language, but the system around it provides context, retrieves knowledge, calls tools, enforces safety, monitors quality, and controls cost and latency.
3️⃣ Core Components
Model
The LLM is responsible for:
- Understanding user input
- Generating text
- Reasoning over context
- Following instructions
- Producing structured output
Prompt
Prompt defines:
- Task
- Role
- Instructions
- Output format
- Constraints
- Examples
Context
Context is information given to the model at runtime.
Examples:
- User question
- Conversation history
- Retrieved documents
- Tool results
- User profile
- System instructions
👉 Interview Answer
The LLM itself is only one component.
A production LLM system also needs prompt design, context management, retrieval, tool integration, output validation, monitoring, and safety controls.
4️⃣ Prompt Engineering
Prompt Structure
System instruction
→ Developer instruction
→ User input
→ Retrieved context
→ Output format
→ Constraints
Good Prompt Should Define
- What the model should do
- What information it can use
- What output format is expected
- What it should avoid
- How to handle uncertainty
👉 Interview Answer
Prompt engineering is about controlling model behavior.
A good prompt clearly defines the task, constraints, context, and expected output format.
For production systems, prompts should be versioned, tested, and monitored like code.
5️⃣ Context Window
What Is Context Window?
The context window is the amount of text the model can see at once.
It includes:
system instructions
conversation history
retrieved documents
tool results
user input
Challenge
The model cannot remember unlimited information inside one request.
So we need:
- Summarization
- Retrieval
- Truncation
- Memory
- Context ranking
👉 Interview Answer
The context window limits how much information the model can consider at once.
Since we cannot put everything into the prompt, the system must decide what context is most relevant, retrieve it, rank it, and compress or summarize it when needed.
6️⃣ RAG: Retrieval-Augmented Generation
What Is RAG?
RAG means:
Retrieve relevant knowledge
→ Add it to prompt
→ Let LLM generate answer
RAG Flow
User question
→ Convert question to embedding
→ Search vector database
→ Retrieve relevant chunks
→ Add chunks to prompt
→ LLM answers with context
Why RAG?
LLMs may not know:
- Private company data
- Recent information
- Domain-specific documents
- User-specific content
👉 Interview Answer
RAG helps ground LLM responses in external knowledge.
Instead of relying only on model memory, the system retrieves relevant documents at runtime and provides them as context to the model.
This improves factuality and allows the LLM to answer questions about private or updated data.
7️⃣ Embeddings and Vector Search
Embedding
An embedding converts text into a vector.
"refund policy" → [0.12, -0.44, 0.89, ...]
Similarity Search
Similar meanings have similar vectors.
query embedding
→ find nearest document embeddings
Used For
- Document retrieval
- Semantic search
- Recommendation
- Deduplication
- Clustering
👉 Interview Answer
Embeddings allow semantic search.
The system converts documents and user queries into vectors, then finds the most similar document chunks using vector search.
This is the foundation of many RAG systems.
8️⃣ Tool Calling
Why Tool Calling?
LLMs cannot reliably do everything internally.
Tools can provide:
- Database lookup
- Search
- Calculator
- Calendar
- Payment API
- Code execution
- Internal service calls
Tool Calling Flow
User asks question
→ LLM decides tool is needed
→ System calls tool
→ Tool returns result
→ LLM uses result to answer
👉 Interview Answer
Tool calling allows an LLM to interact with external systems.
The model decides what tool to call and with what arguments, while the application executes the tool safely and returns the result back to the model.
This turns the LLM from a text generator into an action-capable system.
9️⃣ Agents
What Is an Agent?
An agent is an LLM system that can:
- Plan steps
- Use tools
- Observe results
- Decide next action
- Iterate until task is done
Agent Loop
Goal
→ Plan
→ Tool call
→ Observe result
→ Decide next step
→ Final answer
When Agents Are Useful
- Research
- Coding
- Workflow automation
- Data analysis
- Multi-step troubleshooting
- Operations assistants
👉 Interview Answer
An agent is an LLM system that can take multiple steps toward a goal.
It can plan, call tools, observe results, and continue until it reaches an answer or completes a task.
The challenge is controlling reliability, cost, latency, and safety.
🔟 Memory
Types of Memory
Short-term Memory
Conversation history.
Long-term Memory
Persistent user or domain information.
Working Memory
Temporary state during task execution.
Memory Challenges
- What should be saved?
- What should be forgotten?
- How to avoid privacy risks?
- How to retrieve relevant memory?
- How to prevent stale memory?
👉 Interview Answer
Memory helps LLM systems personalize and continue work over time.
Short-term memory comes from conversation history.
Long-term memory stores persistent facts or preferences.
But memory must be managed carefully because it creates privacy, accuracy, and stale-information risks.
1️⃣1️⃣ Output Validation
Why Needed?
LLMs can produce:
- Wrong format
- Invalid JSON
- Unsupported action
- Hallucinated facts
- Unsafe content
- Incomplete answers
Validation Techniques
- JSON schema validation
- Type checking
- Business rule validation
- Citation checking
- Tool result verification
- Retry with correction prompt
👉 Interview Answer
Production LLM systems should not blindly trust model output.
The system should validate structure, check business rules, verify tool results, and retry or fallback when output is invalid.
1️⃣2️⃣ Evaluation
Offline Evaluation
Use test datasets.
Metrics:
- Accuracy
- Factuality
- Relevance
- Format correctness
- Safety
- Tool-call correctness
- Citation quality
Online Evaluation
Use production signals.
Metrics:
- User satisfaction
- Task completion rate
- Escalation rate
- Latency
- Cost
- Error rate
- Human review score
👉 Interview Answer
LLM systems need continuous evaluation.
Offline evaluation tests prompts and models before launch.
Online evaluation monitors real user outcomes, cost, latency, safety, and task success after launch.
1️⃣3️⃣ Safety and Guardrails
Safety Risks
- Hallucination
- Prompt injection
- Data leakage
- Unsafe actions
- Toxic content
- Unauthorized tool use
- Over-confident answers
Guardrails
- Input filtering
- Output filtering
- Tool permission checks
- Retrieval source validation
- PII redaction
- Human approval for risky actions
- Refusal policies
- Audit logs
👉 Interview Answer
Safety is a core part of LLM system design.
The system should protect against hallucination, prompt injection, data leakage, unsafe tool calls, and unauthorized actions.
Guardrails should be applied around both model input and model output.
1️⃣4️⃣ Latency and Cost
Main Cost Drivers
- Model size
- Token count
- Context length
- Number of tool calls
- Number of agent steps
- Retrieval volume
- Retry count
Optimization Strategies
- Use smaller models for simple tasks
- Compress context
- Cache responses
- Cache retrieval results
- Limit agent steps
- Stream responses
- Use batch processing where possible
👉 Interview Answer
LLM systems must optimize both latency and cost.
Long prompts, large models, tool calls, and multi-step agents increase cost and latency.
A good system routes simple tasks to cheaper models and reserves expensive reasoning models for harder tasks.
1️⃣5️⃣ Observability
What to Log
- Prompt version
- Model version
- Input length
- Output length
- Retrieved documents
- Tool calls
- Latency
- Cost
- Error type
- User feedback
- Safety flags
Why Important?
- Debug bad answers
- Compare model versions
- Detect regressions
- Monitor cost
- Improve prompts
- Evaluate tool performance
👉 Interview Answer
Observability is critical for LLM systems.
I would log prompt version, model version, retrieval results, tool calls, latency, token usage, cost, validation errors, and user feedback.
This makes the system debuggable and improvable.
1️⃣6️⃣ Common Architectures
Simple Chatbot
User input
→ Prompt
→ LLM
→ Response
Good for simple Q&A.
RAG Chatbot
User input
→ Retriever
→ Documents
→ Prompt
→ LLM
→ Answer with sources
Good for enterprise knowledge.
Tool-using Assistant
User input
→ LLM
→ Tool call
→ Tool result
→ LLM
→ Final answer
Good for workflows.
Agentic System
Goal
→ Planner
→ Tools
→ Memory
→ Reflection / validation
→ Final result
Good for multi-step tasks.
1️⃣7️⃣ End-to-End Flow
RAG Flow
User asks question
→ System rewrites query if needed
→ Retrieve relevant documents
→ Rank and filter context
→ Build prompt
→ LLM generates answer
→ Validate answer
→ Return with citations
Tool Flow
User asks task
→ LLM identifies needed tool
→ System validates permission
→ Tool executes
→ Result returned to LLM
→ LLM summarizes result
Agent Flow
User gives goal
→ Agent plans steps
→ Agent calls tools
→ Agent observes results
→ Agent revises plan
→ Agent returns final answer
🧠 Staff-Level Answer Final
👉 Interview Answer Full Version
When designing LLM systems, I do not think of the model alone.
I think of the full application architecture around the model.
The LLM is responsible for language understanding and generation, but the system must provide context, retrieve knowledge, call tools, validate outputs, enforce safety, monitor quality, and control cost and latency.
A basic LLM system takes user input, builds a prompt, calls the model, and returns a response.
More advanced systems use RAG, where the system retrieves relevant documents and adds them to the prompt so the model can answer using external knowledge.
For actions, the system can use tool calling, where the model decides which tool to use and the application executes that tool safely.
For multi-step tasks, we can build agents that plan, call tools, observe results, and iterate.
However, LLM systems need strong guardrails. The system must handle hallucination, prompt injection, data leakage, invalid outputs, unsafe actions, and privacy risks.
Production systems should validate model outputs, log prompt and model versions, evaluate quality offline and online, and monitor latency, cost, safety, and user satisfaction.
The key trade-offs are quality, latency, cost, reliability, and safety.
Ultimately, the goal is to build a system that uses the LLM as a reasoning and language component, while the surrounding architecture provides grounding, control, safety, and reliability.
⭐ Final Insight
LLM System 的核心不是“调用一个模型”, 而是把 model、prompt、context、RAG、tools、memory、evaluation、safety 组合成一个可控、可靠、可评估的应用系统。
中文部分
🎯 Intro to LLM Systems
1️⃣ 核心框架
在讨论 LLM Systems 时,我通常从以下几个方面分析:
- Model layer
- Prompt and context layer
- Retrieval / RAG layer
- Tool calling layer
- Memory and personalization
- Evaluation and monitoring
- Safety and guardrails
- 核心权衡:quality vs latency vs cost
2️⃣ 什么是 LLM System?
LLM system 不只是一个模型。
它是围绕 large language model 构建的一整套应用架构。
User Input
→ Prompt Builder
→ Context Retrieval
→ LLM
→ Tool Calls
→ Response Parser
→ Safety Checks
→ Final Answer
👉 面试回答
LLM system 是围绕 large language model 构建的应用系统。
Model 负责生成语言, 但 model 周围的系统负责提供 context、 检索知识、调用 tools、执行 safety、 监控质量,并控制 cost 和 latency。
3️⃣ 核心组件
Model
LLM 负责:
- 理解用户输入
- 生成文本
- 基于 context 推理
- 遵循 instructions
- 生成 structured output
Prompt
Prompt 定义:
- Task
- Role
- Instructions
- Output format
- Constraints
- Examples
Context
Context 是 runtime 提供给模型的信息。
例如:
- User question
- Conversation history
- Retrieved documents
- Tool results
- User profile
- System instructions
👉 面试回答
LLM 本身只是一个组件。
Production LLM system 还需要 prompt design、 context management、retrieval、tool integration、 output validation、monitoring 和 safety controls。
4️⃣ Prompt Engineering
Prompt Structure
System instruction
→ Developer instruction
→ User input
→ Retrieved context
→ Output format
→ Constraints
Good Prompt Should Define
- 模型应该做什么
- 可以使用哪些信息
- 输出格式是什么
- 应该避免什么
- 不确定时如何处理
👉 面试回答
Prompt engineering 是控制 model behavior 的方法。
好的 prompt 会清楚定义 task、constraints、 context 和 expected output format。
对 production systems 来说, prompts 应该像代码一样 versioned、tested 和 monitored。
5️⃣ Context Window
什么是 Context Window?
Context window 是模型一次请求中能看到的文本量。
它包括:
system instructions
conversation history
retrieved documents
tool results
user input
Challenge
模型不能在一个 request 中记住无限信息。
所以需要:
- Summarization
- Retrieval
- Truncation
- Memory
- Context ranking
👉 面试回答
Context window 限制了模型一次能考虑多少信息。
因为不能把所有内容都塞进 prompt, 系统必须决定哪些 context 最相关, 检索它们、排序它们, 并在必要时压缩或总结。
6️⃣ RAG: Retrieval-Augmented Generation
什么是 RAG?
RAG 表示:
Retrieve relevant knowledge
→ Add it to prompt
→ Let LLM generate answer
RAG Flow
User question
→ Convert question to embedding
→ Search vector database
→ Retrieve relevant chunks
→ Add chunks to prompt
→ LLM answers with context
为什么需要 RAG?
LLM 可能不知道:
- Private company data
- Recent information
- Domain-specific documents
- User-specific content
👉 面试回答
RAG 可以让 LLM 的回答基于外部知识。
系统不只依赖 model memory, 而是在 runtime 检索相关 documents, 并把它们作为 context 提供给模型。
这样可以提升 factuality, 也让 LLM 能回答 private data 或 updated data 相关问题。
7️⃣ Embeddings and Vector Search
Embedding
Embedding 将文本转换成向量。
"refund policy" → [0.12, -0.44, 0.89, ...]
Similarity Search
语义相近的文本,向量也相近。
query embedding
→ find nearest document embeddings
Used For
- Document retrieval
- Semantic search
- Recommendation
- Deduplication
- Clustering
👉 面试回答
Embeddings 让 semantic search 成为可能。
系统将 documents 和 user queries 转换为 vectors, 然后用 vector search 找到最相似的 document chunks。
这是很多 RAG systems 的基础。
8️⃣ Tool Calling
为什么需要 Tool Calling?
LLM 不能可靠地在内部完成所有事情。
Tools 可以提供:
- Database lookup
- Search
- Calculator
- Calendar
- Payment API
- Code execution
- Internal service calls
Tool Calling Flow
User asks question
→ LLM decides tool is needed
→ System calls tool
→ Tool returns result
→ LLM uses result to answer
👉 面试回答
Tool calling 让 LLM 可以和外部系统交互。
Model 决定调用哪个 tool 以及传什么 arguments, application 安全地执行 tool, 然后把结果返回给 model。
这样 LLM 就不只是 text generator, 而是可以执行动作的系统。
9️⃣ Agents
什么是 Agent?
Agent 是一种可以执行以下能力的 LLM system:
- Plan steps
- Use tools
- Observe results
- Decide next action
- Iterate until task is done
Agent Loop
Goal
→ Plan
→ Tool call
→ Observe result
→ Decide next step
→ Final answer
When Agents Are Useful
- Research
- Coding
- Workflow automation
- Data analysis
- Multi-step troubleshooting
- Operations assistants
👉 面试回答
Agent 是可以朝着目标执行多步操作的 LLM system。
它可以 plan、call tools、observe results, 并继续执行直到得到答案或完成任务。
难点是控制 reliability、cost、latency 和 safety。
🔟 Memory
Types of Memory
Short-term Memory
Conversation history.
Long-term Memory
Persistent user or domain information.
Working Memory
任务执行过程中的临时状态。
Memory Challenges
- 什么应该被保存?
- 什么应该被忘记?
- 如何避免 privacy risks?
- 如何检索相关 memory?
- 如何防止 stale memory?
👉 面试回答
Memory 帮助 LLM systems 个性化, 并在多次交互中延续任务。
Short-term memory 来自 conversation history。
Long-term memory 存储持久 facts 或 preferences。
但 memory 必须谨慎管理, 因为它会带来 privacy、accuracy 和 stale-information risks。
1️⃣1️⃣ Output Validation
为什么需要?
LLM 可能产生:
- 错误格式
- Invalid JSON
- Unsupported action
- Hallucinated facts
- Unsafe content
- Incomplete answers
Validation Techniques
- JSON schema validation
- Type checking
- Business rule validation
- Citation checking
- Tool result verification
- Retry with correction prompt
👉 面试回答
Production LLM systems 不能盲目信任 model output。
系统应该验证 structure, 检查 business rules, 验证 tool results, 并在 output invalid 时 retry 或 fallback。
1️⃣2️⃣ Evaluation
Offline Evaluation
使用 test datasets。
Metrics:
- Accuracy
- Factuality
- Relevance
- Format correctness
- Safety
- Tool-call correctness
- Citation quality
Online Evaluation
使用 production signals。
Metrics:
- User satisfaction
- Task completion rate
- Escalation rate
- Latency
- Cost
- Error rate
- Human review score
👉 面试回答
LLM systems 需要 continuous evaluation。
Offline evaluation 在 launch 前测试 prompts 和 models。
Online evaluation 在 launch 后监控真实用户结果、 cost、latency、safety 和 task success。
1️⃣3️⃣ Safety and Guardrails
Safety Risks
- Hallucination
- Prompt injection
- Data leakage
- Unsafe actions
- Toxic content
- Unauthorized tool use
- Over-confident answers
Guardrails
- Input filtering
- Output filtering
- Tool permission checks
- Retrieval source validation
- PII redaction
- Human approval for risky actions
- Refusal policies
- Audit logs
👉 面试回答
Safety 是 LLM system design 的核心部分。
系统应该防范 hallucination、prompt injection、 data leakage、unsafe tool calls 和 unauthorized actions。
Guardrails 应该同时应用在 model input 和 model output 周围。
1️⃣4️⃣ Latency and Cost
Main Cost Drivers
- Model size
- Token count
- Context length
- Number of tool calls
- Number of agent steps
- Retrieval volume
- Retry count
Optimization Strategies
- 简单任务使用小模型
- Compress context
- Cache responses
- Cache retrieval results
- Limit agent steps
- Stream responses
- 可行时使用 batch processing
👉 面试回答
LLM systems 必须优化 latency 和 cost。
Long prompts、大模型、tool calls 和 multi-step agents 都会增加 cost 和 latency。
好的系统会把简单任务路由到更便宜的模型, 把 expensive reasoning models 留给更难任务。
1️⃣5️⃣ Observability
What to Log
- Prompt version
- Model version
- Input length
- Output length
- Retrieved documents
- Tool calls
- Latency
- Cost
- Error type
- User feedback
- Safety flags
Why Important?
- Debug bad answers
- Compare model versions
- Detect regressions
- Monitor cost
- Improve prompts
- Evaluate tool performance
👉 面试回答
Observability 对 LLM systems 很关键。
我会记录 prompt version、model version、 retrieval results、tool calls、latency、 token usage、cost、validation errors 和 user feedback。
这样系统才可以 debug 和持续改进。
1️⃣6️⃣ Common Architectures
Simple Chatbot
User input
→ Prompt
→ LLM
→ Response
适合简单 Q&A。
RAG Chatbot
User input
→ Retriever
→ Documents
→ Prompt
→ LLM
→ Answer with sources
适合 enterprise knowledge。
Tool-using Assistant
User input
→ LLM
→ Tool call
→ Tool result
→ LLM
→ Final answer
适合 workflows。
Agentic System
Goal
→ Planner
→ Tools
→ Memory
→ Reflection / validation
→ Final result
适合 multi-step tasks。
1️⃣7️⃣ End-to-End Flow
RAG Flow
User asks question
→ System rewrites query if needed
→ Retrieve relevant documents
→ Rank and filter context
→ Build prompt
→ LLM generates answer
→ Validate answer
→ Return with citations
Tool Flow
User asks task
→ LLM identifies needed tool
→ System validates permission
→ Tool executes
→ Result returned to LLM
→ LLM summarizes result
Agent Flow
User gives goal
→ Agent plans steps
→ Agent calls tools
→ Agent observes results
→ Agent revises plan
→ Agent returns final answer
🧠 Staff-Level Answer Final
👉 面试回答完整版本
在设计 LLM systems 时, 我不会只考虑 model 本身。
我会考虑围绕 model 构建的完整 application architecture。
LLM 负责 language understanding 和 generation, 但系统必须提供 context、检索知识、调用 tools、 验证 outputs、执行 safety、监控质量, 并控制 cost 和 latency。
一个基础 LLM system 会接收 user input, 构建 prompt, 调用 model, 然后返回 response。
更高级的系统会使用 RAG, 也就是检索相关 documents, 把它们加入 prompt, 让 model 基于外部知识回答。
对于 actions, 系统可以使用 tool calling, 让 model 决定调用哪个 tool, 由 application 安全执行 tool。
对于 multi-step tasks, 可以构建 agents, 让它们 plan、call tools、observe results 并迭代执行。
但是 LLM systems 需要强 guardrails。 系统必须处理 hallucination、prompt injection、 data leakage、invalid outputs、unsafe actions 和 privacy risks。
Production systems 应该验证 model outputs, 记录 prompt 和 model versions, 做 offline 和 online evaluation, 并监控 latency、cost、safety 和 user satisfaction。
核心权衡包括 quality、latency、cost、 reliability 和 safety。
最终目标是把 LLM 作为 reasoning 和 language component, 同时用外围架构提供 grounding、control、 safety 和 reliability。
⭐ Final Insight
LLM System 的核心不是“调用一个模型”, 而是把 model、prompt、context、RAG、tools、memory、evaluation、safety 组合成一个可控、可靠、可评估的应用系统。
Implement