ai-a AI for Engineers ·

🎯 Intro to LLM Systems

1️⃣ Core Framework

When discussing LLM Systems, I frame it as:

Model layer
Prompt and context layer
Retrieval / RAG layer
Tool calling layer
Memory and personalization
Evaluation and monitoring
Safety and guardrails
Trade-offs: quality vs latency vs cost

2️⃣ What Is an LLM System?

An LLM system is not just a model.

It is a full application architecture around a large language model.

User Input
→ Prompt Builder
→ Context Retrieval
→ LLM
→ Tool Calls
→ Response Parser
→ Safety Checks
→ Final Answer

👉 Interview Answer

An LLM system is an application built around a large language model.

The model generates language, but the system around it provides context, retrieves knowledge, calls tools, enforces safety, monitors quality, and controls cost and latency.

3️⃣ Core Components

Model

The LLM is responsible for:

Understanding user input
Generating text
Reasoning over context
Following instructions
Producing structured output

Prompt

Prompt defines:

Task
Role
Instructions
Output format
Constraints
Examples

Context

Context is information given to the model at runtime.

Examples:

User question
Conversation history
Retrieved documents
Tool results
User profile
System instructions

👉 Interview Answer

The LLM itself is only one component.

A production LLM system also needs prompt design, context management, retrieval, tool integration, output validation, monitoring, and safety controls.

4️⃣ Prompt Engineering

Prompt Structure

System instruction
→ Developer instruction
→ User input
→ Retrieved context
→ Output format
→ Constraints

Good Prompt Should Define

What the model should do
What information it can use
What output format is expected
What it should avoid
How to handle uncertainty

👉 Interview Answer

Prompt engineering is about controlling model behavior.

A good prompt clearly defines the task, constraints, context, and expected output format.

For production systems, prompts should be versioned, tested, and monitored like code.

5️⃣ Context Window

What Is Context Window?

The context window is the amount of text the model can see at once.

It includes:

system instructions
conversation history
retrieved documents
tool results
user input

Challenge

The model cannot remember unlimited information inside one request.

So we need:

Summarization
Retrieval
Truncation
Memory
Context ranking

👉 Interview Answer

The context window limits how much information the model can consider at once.

Since we cannot put everything into the prompt, the system must decide what context is most relevant, retrieve it, rank it, and compress or summarize it when needed.

6️⃣ RAG: Retrieval-Augmented Generation

What Is RAG?

RAG means:

Retrieve relevant knowledge
→ Add it to prompt
→ Let LLM generate answer

RAG Flow

User question
→ Convert question to embedding
→ Search vector database
→ Retrieve relevant chunks
→ Add chunks to prompt
→ LLM answers with context

Why RAG?

LLMs may not know:

Private company data
Recent information
Domain-specific documents
User-specific content

👉 Interview Answer

RAG helps ground LLM responses in external knowledge.

Instead of relying only on model memory, the system retrieves relevant documents at runtime and provides them as context to the model.

This improves factuality and allows the LLM to answer questions about private or updated data.

7️⃣ Embeddings and Vector Search

Embedding

An embedding converts text into a vector.

"refund policy" → [0.12, -0.44, 0.89, ...]

Similarity Search

Similar meanings have similar vectors.

query embedding
→ find nearest document embeddings

Used For

Document retrieval
Semantic search
Recommendation
Deduplication
Clustering

👉 Interview Answer

Embeddings allow semantic search.

The system converts documents and user queries into vectors, then finds the most similar document chunks using vector search.

This is the foundation of many RAG systems.

8️⃣ Tool Calling

Why Tool Calling?

LLMs cannot reliably do everything internally.

Tools can provide:

Database lookup
Search
Calculator
Calendar
Email
Payment API
Code execution
Internal service calls

Tool Calling Flow

User asks question
→ LLM decides tool is needed
→ System calls tool
→ Tool returns result
→ LLM uses result to answer

👉 Interview Answer

Tool calling allows an LLM to interact with external systems.

The model decides what tool to call and with what arguments, while the application executes the tool safely and returns the result back to the model.

This turns the LLM from a text generator into an action-capable system.

9️⃣ Agents

What Is an Agent?

An agent is an LLM system that can:

Plan steps
Use tools
Observe results
Decide next action
Iterate until task is done

Agent Loop

Goal
→ Plan
→ Tool call
→ Observe result
→ Decide next step
→ Final answer

When Agents Are Useful

Research
Coding
Workflow automation
Data analysis
Multi-step troubleshooting
Operations assistants

👉 Interview Answer

An agent is an LLM system that can take multiple steps toward a goal.

It can plan, call tools, observe results, and continue until it reaches an answer or completes a task.

The challenge is controlling reliability, cost, latency, and safety.

🔟 Memory

Types of Memory

Short-term Memory

Conversation history.

Long-term Memory

Persistent user or domain information.

Working Memory

Temporary state during task execution.

Memory Challenges

What should be saved?
What should be forgotten?
How to avoid privacy risks?
How to retrieve relevant memory?
How to prevent stale memory?

👉 Interview Answer

Memory helps LLM systems personalize and continue work over time.

Short-term memory comes from conversation history.

Long-term memory stores persistent facts or preferences.

But memory must be managed carefully because it creates privacy, accuracy, and stale-information risks.

1️⃣1️⃣ Output Validation

Why Needed?

LLMs can produce:

Wrong format
Invalid JSON
Unsupported action
Hallucinated facts
Unsafe content
Incomplete answers

Validation Techniques

JSON schema validation
Type checking
Business rule validation
Citation checking
Tool result verification
Retry with correction prompt

👉 Interview Answer

Production LLM systems should not blindly trust model output.

The system should validate structure, check business rules, verify tool results, and retry or fallback when output is invalid.

1️⃣2️⃣ Evaluation

Offline Evaluation

Use test datasets.

Metrics:

Accuracy
Factuality
Relevance
Format correctness
Safety
Tool-call correctness
Citation quality

Online Evaluation

Use production signals.

Metrics:

User satisfaction
Task completion rate
Escalation rate
Latency
Cost
Error rate
Human review score

👉 Interview Answer

LLM systems need continuous evaluation.

Offline evaluation tests prompts and models before launch.

Online evaluation monitors real user outcomes, cost, latency, safety, and task success after launch.

1️⃣3️⃣ Safety and Guardrails

Safety Risks

Hallucination
Prompt injection
Data leakage
Unsafe actions
Toxic content
Unauthorized tool use
Over-confident answers

Guardrails

Input filtering
Output filtering
Tool permission checks
Retrieval source validation
PII redaction
Human approval for risky actions
Refusal policies
Audit logs

👉 Interview Answer

Safety is a core part of LLM system design.

The system should protect against hallucination, prompt injection, data leakage, unsafe tool calls, and unauthorized actions.

Guardrails should be applied around both model input and model output.

1️⃣4️⃣ Latency and Cost

Main Cost Drivers

Model size
Token count
Context length
Number of tool calls
Number of agent steps
Retrieval volume
Retry count

Optimization Strategies

Use smaller models for simple tasks
Compress context
Cache responses
Cache retrieval results
Limit agent steps
Stream responses
Use batch processing where possible

👉 Interview Answer

LLM systems must optimize both latency and cost.

Long prompts, large models, tool calls, and multi-step agents increase cost and latency.

A good system routes simple tasks to cheaper models and reserves expensive reasoning models for harder tasks.

1️⃣5️⃣ Observability

What to Log

Prompt version
Model version
Input length
Output length
Retrieved documents
Tool calls
Latency
Cost
Error type
User feedback
Safety flags

Why Important?

Debug bad answers
Compare model versions
Detect regressions
Monitor cost
Improve prompts
Evaluate tool performance

👉 Interview Answer

Observability is critical for LLM systems.

I would log prompt version, model version, retrieval results, tool calls, latency, token usage, cost, validation errors, and user feedback.

This makes the system debuggable and improvable.

1️⃣6️⃣ Common Architectures

Simple Chatbot

User input
→ Prompt
→ LLM
→ Response

Good for simple Q&A.

RAG Chatbot

User input
→ Retriever
→ Documents
→ Prompt
→ LLM
→ Answer with sources

Good for enterprise knowledge.

Tool-using Assistant

User input
→ LLM
→ Tool call
→ Tool result
→ LLM
→ Final answer

Good for workflows.

Agentic System

Goal
→ Planner
→ Tools
→ Memory
→ Reflection / validation
→ Final result

Good for multi-step tasks.

1️⃣7️⃣ End-to-End Flow

RAG Flow

User asks question
→ System rewrites query if needed
→ Retrieve relevant documents
→ Rank and filter context
→ Build prompt
→ LLM generates answer
→ Validate answer
→ Return with citations

Tool Flow

User asks task
→ LLM identifies needed tool
→ System validates permission
→ Tool executes
→ Result returned to LLM
→ LLM summarizes result

Agent Flow

User gives goal
→ Agent plans steps
→ Agent calls tools
→ Agent observes results
→ Agent revises plan
→ Agent returns final answer

🧠 Staff-Level Answer Final

👉 Interview Answer Full Version

When designing LLM systems, I do not think of the model alone.

I think of the full application architecture around the model.

The LLM is responsible for language understanding and generation, but the system must provide context, retrieve knowledge, call tools, validate outputs, enforce safety, monitor quality, and control cost and latency.

A basic LLM system takes user input, builds a prompt, calls the model, and returns a response.

More advanced systems use RAG, where the system retrieves relevant documents and adds them to the prompt so the model can answer using external knowledge.

For actions, the system can use tool calling, where the model decides which tool to use and the application executes that tool safely.

For multi-step tasks, we can build agents that plan, call tools, observe results, and iterate.

However, LLM systems need strong guardrails. The system must handle hallucination, prompt injection, data leakage, invalid outputs, unsafe actions, and privacy risks.

Production systems should validate model outputs, log prompt and model versions, evaluate quality offline and online, and monitor latency, cost, safety, and user satisfaction.

The key trade-offs are quality, latency, cost, reliability, and safety.

Ultimately, the goal is to build a system that uses the LLM as a reasoning and language component, while the surrounding architecture provides grounding, control, safety, and reliability.

⭐ Final Insight

LLM System 的核心不是“调用一个模型”，而是把 model、prompt、context、RAG、tools、memory、evaluation、safety 组合成一个可控、可靠、可评估的应用系统。

中文部分

🎯 Intro to LLM Systems

1️⃣ 核心框架

在讨论 LLM Systems 时，我通常从以下几个方面分析：

Model layer
Prompt and context layer
Retrieval / RAG layer
Tool calling layer
Memory and personalization
Evaluation and monitoring
Safety and guardrails
核心权衡：quality vs latency vs cost

2️⃣ 什么是 LLM System？

LLM system 不只是一个模型。

它是围绕 large language model 构建的一整套应用架构。

User Input
→ Prompt Builder
→ Context Retrieval
→ LLM
→ Tool Calls
→ Response Parser
→ Safety Checks
→ Final Answer

👉 面试回答

LLM system 是围绕 large language model 构建的应用系统。

Model 负责生成语言，但 model 周围的系统负责提供 context、检索知识、调用 tools、执行 safety、监控质量，并控制 cost 和 latency。

3️⃣ 核心组件

Model

LLM 负责：

理解用户输入
生成文本
基于 context 推理
遵循 instructions
生成 structured output

Prompt

Prompt 定义：

Task
Role
Instructions
Output format
Constraints
Examples

Context

Context 是 runtime 提供给模型的信息。

例如：

User question
Conversation history
Retrieved documents
Tool results
User profile
System instructions

👉 面试回答

LLM 本身只是一个组件。

Production LLM system 还需要 prompt design、 context management、retrieval、tool integration、 output validation、monitoring 和 safety controls。

4️⃣ Prompt Engineering

Prompt Structure

System instruction
→ Developer instruction
→ User input
→ Retrieved context
→ Output format
→ Constraints

Good Prompt Should Define

模型应该做什么
可以使用哪些信息
输出格式是什么
应该避免什么
不确定时如何处理

👉 面试回答

Prompt engineering 是控制 model behavior 的方法。

好的 prompt 会清楚定义 task、constraints、 context 和 expected output format。

对 production systems 来说， prompts 应该像代码一样 versioned、tested 和 monitored。

5️⃣ Context Window

什么是 Context Window？

Context window 是模型一次请求中能看到的文本量。

它包括：

system instructions
conversation history
retrieved documents
tool results
user input

Challenge

模型不能在一个 request 中记住无限信息。

所以需要：

Summarization
Retrieval
Truncation
Memory
Context ranking

👉 面试回答

Context window 限制了模型一次能考虑多少信息。

因为不能把所有内容都塞进 prompt，系统必须决定哪些 context 最相关，检索它们、排序它们，并在必要时压缩或总结。

6️⃣ RAG: Retrieval-Augmented Generation

什么是 RAG？

RAG 表示：

Retrieve relevant knowledge
→ Add it to prompt
→ Let LLM generate answer

RAG Flow

User question
→ Convert question to embedding
→ Search vector database
→ Retrieve relevant chunks
→ Add chunks to prompt
→ LLM answers with context

为什么需要 RAG？

LLM 可能不知道：

Private company data
Recent information
Domain-specific documents
User-specific content

👉 面试回答

RAG 可以让 LLM 的回答基于外部知识。

系统不只依赖 model memory，而是在 runtime 检索相关 documents，并把它们作为 context 提供给模型。

这样可以提升 factuality，也让 LLM 能回答 private data 或 updated data 相关问题。

7️⃣ Embeddings and Vector Search

Embedding

Embedding 将文本转换成向量。

"refund policy" → [0.12, -0.44, 0.89, ...]

Similarity Search

语义相近的文本，向量也相近。

query embedding
→ find nearest document embeddings

Used For

Document retrieval
Semantic search
Recommendation
Deduplication
Clustering

👉 面试回答

Embeddings 让 semantic search 成为可能。

系统将 documents 和 user queries 转换为 vectors，然后用 vector search 找到最相似的 document chunks。

这是很多 RAG systems 的基础。

8️⃣ Tool Calling

为什么需要 Tool Calling？

LLM 不能可靠地在内部完成所有事情。

Tools 可以提供：

Database lookup
Search
Calculator
Calendar
Email
Payment API
Code execution
Internal service calls

Tool Calling Flow

User asks question
→ LLM decides tool is needed
→ System calls tool
→ Tool returns result
→ LLM uses result to answer

👉 面试回答

Tool calling 让 LLM 可以和外部系统交互。

Model 决定调用哪个 tool 以及传什么 arguments， application 安全地执行 tool，然后把结果返回给 model。

这样 LLM 就不只是 text generator，而是可以执行动作的系统。

9️⃣ Agents

什么是 Agent？

Agent 是一种可以执行以下能力的 LLM system：

Plan steps
Use tools
Observe results
Decide next action
Iterate until task is done

Agent Loop

Goal
→ Plan
→ Tool call
→ Observe result
→ Decide next step
→ Final answer

When Agents Are Useful

Research
Coding
Workflow automation
Data analysis
Multi-step troubleshooting
Operations assistants

👉 面试回答

Agent 是可以朝着目标执行多步操作的 LLM system。

它可以 plan、call tools、observe results，并继续执行直到得到答案或完成任务。

难点是控制 reliability、cost、latency 和 safety。

🔟 Memory

Types of Memory

Short-term Memory

Conversation history.

Long-term Memory

Persistent user or domain information.

Working Memory

任务执行过程中的临时状态。

Memory Challenges

什么应该被保存？
什么应该被忘记？
如何避免 privacy risks？
如何检索相关 memory？
如何防止 stale memory？

👉 面试回答

Memory 帮助 LLM systems 个性化，并在多次交互中延续任务。

Short-term memory 来自 conversation history。

Long-term memory 存储持久 facts 或 preferences。

但 memory 必须谨慎管理，因为它会带来 privacy、accuracy 和 stale-information risks。

1️⃣1️⃣ Output Validation

为什么需要？

LLM 可能产生：

错误格式
Invalid JSON
Unsupported action
Hallucinated facts
Unsafe content
Incomplete answers

Validation Techniques

JSON schema validation
Type checking
Business rule validation
Citation checking
Tool result verification
Retry with correction prompt

👉 面试回答

Production LLM systems 不能盲目信任 model output。

系统应该验证 structure，检查 business rules，验证 tool results，并在 output invalid 时 retry 或 fallback。

1️⃣2️⃣ Evaluation

Offline Evaluation

使用 test datasets。

Metrics：

Accuracy
Factuality
Relevance
Format correctness
Safety
Tool-call correctness
Citation quality

Online Evaluation

使用 production signals。

Metrics：

User satisfaction
Task completion rate
Escalation rate
Latency
Cost
Error rate
Human review score

👉 面试回答

LLM systems 需要 continuous evaluation。

Offline evaluation 在 launch 前测试 prompts 和 models。

Online evaluation 在 launch 后监控真实用户结果、 cost、latency、safety 和 task success。

1️⃣3️⃣ Safety and Guardrails

Safety Risks

Hallucination
Prompt injection
Data leakage
Unsafe actions
Toxic content
Unauthorized tool use
Over-confident answers

Guardrails

Input filtering
Output filtering
Tool permission checks
Retrieval source validation
PII redaction
Human approval for risky actions
Refusal policies
Audit logs

👉 面试回答

Safety 是 LLM system design 的核心部分。

系统应该防范 hallucination、prompt injection、 data leakage、unsafe tool calls 和 unauthorized actions。

Guardrails 应该同时应用在 model input 和 model output 周围。

1️⃣4️⃣ Latency and Cost

Main Cost Drivers

Model size
Token count
Context length
Number of tool calls
Number of agent steps
Retrieval volume
Retry count

Optimization Strategies

简单任务使用小模型
Compress context
Cache responses
Cache retrieval results
Limit agent steps
Stream responses
可行时使用 batch processing

👉 面试回答

LLM systems 必须优化 latency 和 cost。

Long prompts、大模型、tool calls 和 multi-step agents 都会增加 cost 和 latency。

好的系统会把简单任务路由到更便宜的模型，把 expensive reasoning models 留给更难任务。

1️⃣5️⃣ Observability

What to Log

Prompt version
Model version
Input length
Output length
Retrieved documents
Tool calls
Latency
Cost
Error type
User feedback
Safety flags

Why Important?

Debug bad answers
Compare model versions
Detect regressions
Monitor cost
Improve prompts
Evaluate tool performance

👉 面试回答

Observability 对 LLM systems 很关键。

我会记录 prompt version、model version、 retrieval results、tool calls、latency、 token usage、cost、validation errors 和 user feedback。

这样系统才可以 debug 和持续改进。

1️⃣6️⃣ Common Architectures

Simple Chatbot

User input
→ Prompt
→ LLM
→ Response

适合简单 Q&A。

RAG Chatbot

User input
→ Retriever
→ Documents
→ Prompt
→ LLM
→ Answer with sources

适合 enterprise knowledge。

Tool-using Assistant

User input
→ LLM
→ Tool call
→ Tool result
→ LLM
→ Final answer

适合 workflows。

Agentic System

Goal
→ Planner
→ Tools
→ Memory
→ Reflection / validation
→ Final result

适合 multi-step tasks。

1️⃣7️⃣ End-to-End Flow

RAG Flow

User asks question
→ System rewrites query if needed
→ Retrieve relevant documents
→ Rank and filter context
→ Build prompt
→ LLM generates answer
→ Validate answer
→ Return with citations

Tool Flow

User asks task
→ LLM identifies needed tool
→ System validates permission
→ Tool executes
→ Result returned to LLM
→ LLM summarizes result

Agent Flow

User gives goal
→ Agent plans steps
→ Agent calls tools
→ Agent observes results
→ Agent revises plan
→ Agent returns final answer

🧠 Staff-Level Answer Final

👉 面试回答完整版本

在设计 LLM systems 时，我不会只考虑 model 本身。

我会考虑围绕 model 构建的完整 application architecture。

LLM 负责 language understanding 和 generation，但系统必须提供 context、检索知识、调用 tools、验证 outputs、执行 safety、监控质量，并控制 cost 和 latency。

一个基础 LLM system 会接收 user input，构建 prompt，调用 model，然后返回 response。

更高级的系统会使用 RAG，也就是检索相关 documents，把它们加入 prompt，让 model 基于外部知识回答。

对于 actions，系统可以使用 tool calling，让 model 决定调用哪个 tool，由 application 安全执行 tool。

对于 multi-step tasks，可以构建 agents，让它们 plan、call tools、observe results 并迭代执行。

但是 LLM systems 需要强 guardrails。系统必须处理 hallucination、prompt injection、 data leakage、invalid outputs、unsafe actions 和 privacy risks。

Production systems 应该验证 model outputs，记录 prompt 和 model versions，做 offline 和 online evaluation，并监控 latency、cost、safety 和 user satisfaction。

核心权衡包括 quality、latency、cost、 reliability 和 safety。

最终目标是把 LLM 作为 reasoning 和 language component，同时用外围架构提供 grounding、control、 safety 和 reliability。

⭐ Final Insight

LLM System 的核心不是“调用一个模型”，而是把 model、prompt、context、RAG、tools、memory、evaluation、safety 组合成一个可控、可靠、可评估的应用系统。