·

System Design Deep Dive - 04 Design an AI Recommendation System with LLMs

Post by ailswan May. 24, 2026

中文 ↓

🎯 Design an AI Recommendation System with LLMs


1️⃣ Core Framework

When designing an AI Recommendation System with LLMs, I frame it as:

  1. Product requirements
  2. User and item modeling
  3. Candidate generation
  4. Ranking and personalization
  5. LLM-based reasoning layer
  6. Feedback loop
  7. Safety and fairness
  8. Trade-offs: relevance vs latency vs cost

2️⃣ Product Goal

An AI recommendation system suggests relevant items to users.

Examples:


Basic Flow

User Context
→ Candidate Retrieval
→ Ranking
→ LLM Reasoning / Explanation
→ Personalized Recommendations
→ Feedback Loop

👉 Interview Answer

An AI recommendation system uses user behavior, item metadata, embeddings, ranking models, and sometimes LLM reasoning to recommend relevant items.

The LLM is usually not the whole recommender.

It is often used for understanding intent, enriching metadata, explaining recommendations, or re-ranking small candidate sets.


3️⃣ Functional Requirements


Core Features

The system should support:


Examples

Recommend products similar to this item.

Recommend videos based on watch history.

Recommend jobs based on resume and preferences.

Recommend documents based on current task.

👉 Interview Answer

Core requirements include personalized recommendations, similar-item recommendations, cold-start handling, ranking, feedback collection, filtering, and recommendation explanations.


4️⃣ Non-functional Requirements


Important System Qualities

The system should optimize for:


Key Trade-off

More personalization and LLM reasoning
→ Better quality

But also
→ Higher latency and cost

👉 Interview Answer

Non-functional requirements include relevance, latency, scalability, freshness, diversity, fairness, explainability, privacy, and cost efficiency.

The core trade-off is recommendation quality versus latency and cost.


5️⃣ High-Level Architecture


Architecture

Client
→ Recommendation API
→ User Profile Service
→ Candidate Generation
→ Feature Store
→ Ranking Service
→ LLM Reasoning / Explanation Service
→ Business Rules / Safety Filters
→ Response
→ Feedback Pipeline

Core Components

User Profile Service

Stores user preferences and behavior.


Candidate Generation

Finds possible recommendation items.


Ranking Service

Scores and orders candidates.


LLM Layer

Explains, re-ranks, or personalizes recommendations.


Feedback Pipeline

Collects clicks, views, purchases, likes, skips, and conversions.


👉 Interview Answer

A production recommendation system usually includes user profiles, item catalogs, candidate generation, feature stores, ranking models, business filters, an optional LLM reasoning layer, and a feedback pipeline.


6️⃣ Data Model


Main Entities

User
Item
UserEvent
RecommendationRequest
CandidateItem
RankedItem
FeedbackEvent

User Profile Example

{
  "user_id": "user_123",
  "interests": ["machine learning", "system design"],
  "recent_views": ["item_1", "item_2"],
  "purchased_items": ["item_3"],
  "negative_feedback": ["item_9"]
}

Item Metadata Example

{
  "item_id": "item_456",
  "title": "Distributed Systems Course",
  "category": "education",
  "tags": ["backend", "scalability", "architecture"],
  "embedding": [0.12, -0.44, 0.88]
}

👉 Interview Answer

I would model users, items, user events, candidate items, ranked items, and feedback events.

User profiles capture preferences and behavior, while item metadata captures categories, tags, text, images, embeddings, and business attributes.


7️⃣ User Modeling


What User Profile Includes

A user profile may include:


Short-term vs Long-term Preference

Short-term:
What user is doing now.

Long-term:
What user generally likes.

👉 Interview Answer

User modeling combines long-term preferences and short-term session intent.

Long-term profile captures stable interests, while short-term behavior captures what the user wants right now.


8️⃣ Item Modeling


Item Features

Items can be represented using:


LLM Role in Item Modeling

LLMs can help generate:


👉 Interview Answer

Item modeling combines structured metadata, behavioral signals, embeddings, and generated attributes.

LLMs are useful for enriching item metadata, extracting attributes, summarizing content, and improving semantic matching.


9️⃣ Candidate Generation


Why Candidate Generation Matters

The system may have millions or billions of items.

It cannot rank everything deeply.


Candidate Sources


Flow

User Profile
→ Retrieve 1,000 candidates
→ Pass to ranking stage

👉 Interview Answer

Candidate generation retrieves a smaller set of potentially relevant items from a large catalog.

It uses collaborative filtering, content-based retrieval, vector search, popularity, trending signals, and business rules.


🔟 Ranking


Ranking Goal

Ranking orders candidates by expected user value.


Ranking Signals


Ranking Flow

Candidates
→ Feature Enrichment
→ Ranking Model
→ Ranked List
→ Filters
→ Final Recommendations

👉 Interview Answer

Ranking scores candidate items using user features, item features, behavioral signals, contextual signals, and business rules.

The ranker optimizes for predicted relevance or conversion.


1️⃣1️⃣ Where LLMs Fit


LLMs Are Not Usually the First-stage Ranker

LLMs are expensive.

They are better used for:


Common Pattern

Traditional recommender retrieves top 100
→ LLM re-ranks top 10 or explains results

👉 Interview Answer

LLMs are usually not used to rank millions of items directly.

They are best used after candidate generation, for intent understanding, metadata enrichment, small-set re-ranking, explanations, and conversational recommendation.


1️⃣2️⃣ LLM-based Re-ranking


Why Re-rank with LLM?

LLMs can reason over user intent and item descriptions.


Flow

Top 20 candidates
+ User preference
+ Current query
→ LLM re-ranker
→ Final top 5

Best For


Cost Control

Only use LLMs on small candidate sets.


👉 Interview Answer

LLM re-ranking can improve quality when recommendations depend on nuanced user intent.

But because LLMs are expensive, they should only re-rank a small candidate set after cheaper retrieval and ranking stages.


1️⃣3️⃣ Conversational Recommendation


Why Conversation Helps

Users often do not know exactly what they want.

The assistant can ask follow-up questions.


Example

User:
Recommend a laptop for AI work.

Assistant:
Do you care more about GPU performance,
battery life,
or portability?

Flow

User preference
→ Clarify constraints
→ Retrieve candidates
→ Rank
→ Explain recommendations

👉 Interview Answer

LLMs are especially useful for conversational recommendations.

They can clarify user intent, collect constraints, explain trade-offs, and turn vague preferences into structured recommendation filters.


1️⃣4️⃣ Cold Start


Cold-start Problems

New User

No behavior history.


New Item

No interaction history.


LLM Help

LLMs can use:


👉 Interview Answer

LLMs can help with cold-start problems by using item descriptions, user-provided preferences, onboarding answers, and semantic reasoning before enough behavioral data exists.


1️⃣5️⃣ Feedback Loop


What Feedback to Collect


Feedback Flow

Recommendation shown
→ User interacts
→ Event logged
→ Feature store updated
→ Models retrained
→ Recommendations improve

👉 Interview Answer

Recommendation systems depend on feedback loops.

The system should log impressions, clicks, conversions, skips, ratings, and dwell time, then use these signals to update features and retrain models.


1️⃣6️⃣ Safety, Fairness, and Business Rules


Important Controls

Recommendations must respect:


Example

Do not recommend unavailable products.

Do not recommend restricted content to underage users.

👉 Interview Answer

Recommendation systems need safety, fairness, and business-rule filters.

The final list should respect privacy, availability, age rules, legal constraints, content safety, diversity, and fairness goals.


1️⃣7️⃣ Evaluation Metrics


Offline Metrics


Online Metrics


LLM-specific Metrics


👉 Interview Answer

I would evaluate recommendation systems using both offline and online metrics.

Offline metrics include precision, recall, NDCG, diversity, and coverage.

Online metrics include CTR, conversion, retention, satisfaction, and long-term engagement.


1️⃣8️⃣ Common Failure Modes


Failure Modes

AI recommendation systems can fail because of:


Example

LLM explains that a product matches user preference,
but the product is out of stock.

👉 Interview Answer

Recommendation systems fail when they overfit to past behavior, ignore freshness, amplify popularity bias, violate constraints, or produce hallucinated explanations.

LLM-generated explanations must be grounded in real item attributes.


1️⃣9️⃣ Cost Control


Cost Drivers


Controls


👉 Interview Answer

Cost control is important because LLM-based recommendation can be expensive.

I would use traditional retrieval and ranking first, then apply LLMs only on small candidate sets or explanation generation.


2️⃣0️⃣ Best Practices


Practical Rules


Design Principle

Use traditional recommenders for scale.
Use LLMs for reasoning,
conversation,
and explanation.

👉 Interview Answer

A strong LLM recommendation system combines traditional recommender infrastructure with LLM reasoning.

Candidate generation and ranking handle scale.

LLMs help with intent understanding, conversational clarification, small-set re-ranking, cold start, and explanations.


🧠 Staff-Level Answer Final


👉 Interview Answer Full Version

To design an AI recommendation system with LLMs, I would not use the LLM as the entire recommendation engine.

A production recommender still needs scalable candidate generation, ranking, feature stores, user profiles, item catalogs, feedback loops, and business-rule filters.

The system starts by building user profiles from explicit preferences, click history, purchases, searches, ratings, and recent session behavior.

Items are represented using structured metadata, behavioral signals, tags, text descriptions, images, embeddings, freshness, availability, and popularity.

Candidate generation retrieves a manageable set of items from a large catalog using collaborative filtering, content-based retrieval, vector search, popularity, trending signals, graph relationships, and business campaigns.

A ranking model then scores candidates using user features, item features, context, behavioral signals, and business goals.

LLMs fit best after this scalable retrieval and ranking pipeline.

They can help understand natural-language intent, enrich item metadata, solve cold-start problems, ask clarifying questions, re-rank a small candidate set, and generate recommendation explanations.

LLMs should not rank millions of items directly because that would be too slow and too expensive.

For conversational recommendations, the LLM can turn vague user preferences into structured constraints, ask follow-up questions, and explain trade-offs.

For cold-start users or new items, the LLM can use natural-language preferences and item descriptions before enough interaction data exists.

The feedback loop is critical.

The system should collect impressions, clicks, skips, purchases, dwell time, ratings, and user feedback, then update features and retrain ranking models.

Safety and fairness filters should enforce privacy, availability, age restrictions, legal constraints, diversity, and business rules.

Evaluation should include offline metrics like precision@K, recall@K, NDCG, diversity, and coverage, plus online metrics like CTR, conversion, retention, satisfaction, and long-term engagement.

The core principle is: use traditional recommenders for scale, and use LLMs for reasoning, conversation, and explanation.


⭐ Final Insight

LLM Recommendation System 的核心不是:

“把用户信息和所有商品丢给 LLM”

真正的系统是:

User Modeling

  • Item Modeling
  • Candidate Generation
  • Ranking
  • LLM Re-ranking
  • Conversational Clarification
  • Explanation Generation
  • Feedback Loop
  • Safety Filters
  • Evaluation。

传统 recommender 负责 scale。

LLM 负责 reasoning、conversation 和 explanation。

最重要的一句话:

Use traditional recommenders for scale.

Use LLMs for reasoning, conversation, and explanation.


中文部分


🎯 Design an AI Recommendation System with LLMs


1️⃣ 核心框架

设计 AI Recommendation System with LLMs 时,我通常从这些方面分析:

  1. Product requirements
  2. User and item modeling
  3. Candidate generation
  4. Ranking and personalization
  5. LLM-based reasoning layer
  6. Feedback loop
  7. Safety and fairness
  8. 核心权衡:relevance vs latency vs cost

2️⃣ Product Goal

AI recommendation system 向用户推荐相关 items。

Examples:


Basic Flow

User Context
→ Candidate Retrieval
→ Ranking
→ LLM Reasoning / Explanation
→ Personalized Recommendations
→ Feedback Loop

👉 面试回答

AI recommendation system 使用 user behavior、 item metadata、embeddings、ranking models, 有时还使用 LLM reasoning 来推荐 relevant items。

LLM 通常不是整个 recommender。

它更多用于 intent understanding、 metadata enrichment、explanation、 或 small candidate set 的 re-ranking。


3️⃣ Functional Requirements


Core Features

系统应该支持:


Examples

Recommend products similar to this item.

Recommend videos based on watch history.

Recommend jobs based on resume and preferences.

Recommend documents based on current task.

👉 面试回答

核心需求包括 personalized recommendations、 similar-item recommendations、 cold-start handling、ranking、 feedback collection、filtering 和 recommendation explanations。


4️⃣ Non-functional Requirements


Important System Qualities

系统应该优化:


Key Trade-off

More personalization and LLM reasoning
→ Better quality

But also
→ Higher latency and cost

👉 面试回答

Non-functional requirements 包括 relevance、 latency、scalability、freshness、 diversity、fairness、explainability、 privacy 和 cost efficiency。

核心权衡是 recommendation quality 和 latency / cost。


5️⃣ High-Level Architecture


Architecture

Client
→ Recommendation API
→ User Profile Service
→ Candidate Generation
→ Feature Store
→ Ranking Service
→ LLM Reasoning / Explanation Service
→ Business Rules / Safety Filters
→ Response
→ Feedback Pipeline

Core Components

User Profile Service

存储 user preferences 和 behavior。


Candidate Generation

寻找可能推荐的 items。


Ranking Service

为 candidates 打分排序。


LLM Layer

解释、re-rank 或 personalize recommendations。


Feedback Pipeline

收集 clicks、views、purchases、 likes、skips 和 conversions。


👉 面试回答

Production recommendation system 通常包括 user profiles、item catalogs、 candidate generation、feature stores、 ranking models、business filters、 optional LLM reasoning layer 和 feedback pipeline。


6️⃣ Data Model


Main Entities

User
Item
UserEvent
RecommendationRequest
CandidateItem
RankedItem
FeedbackEvent

User Profile Example

{
  "user_id": "user_123",
  "interests": ["machine learning", "system design"],
  "recent_views": ["item_1", "item_2"],
  "purchased_items": ["item_3"],
  "negative_feedback": ["item_9"]
}

Item Metadata Example

{
  "item_id": "item_456",
  "title": "Distributed Systems Course",
  "category": "education",
  "tags": ["backend", "scalability", "architecture"],
  "embedding": [0.12, -0.44, 0.88]
}

👉 面试回答

我会建模 users、items、user events、 candidate items、ranked items 和 feedback events。

User profiles 捕捉 preferences 和 behavior, item metadata 捕捉 categories、tags、 text、images、embeddings 和 business attributes。


7️⃣ User Modeling


User Profile 包括什么?

User profile 可能包括:


Short-term vs Long-term Preference

Short-term:
What user is doing now.

Long-term:
What user generally likes.

👉 面试回答

User modeling 结合 long-term preferences 和 short-term session intent。

Long-term profile 捕捉 stable interests, short-term behavior 捕捉用户现在想要什么。


8️⃣ Item Modeling


Item Features

Items 可以用这些表示:


LLM Role in Item Modeling

LLMs 可以帮助生成:


👉 面试回答

Item modeling 结合 structured metadata、 behavioral signals、embeddings 和 generated attributes。

LLMs 很适合 enrich item metadata、 extract attributes、summarize content 和 improve semantic matching。


9️⃣ Candidate Generation


为什么 Candidate Generation 重要?

系统可能有 millions 或 billions of items。

不能深度排序所有 items。


Candidate Sources


Flow

User Profile
→ Retrieve 1,000 candidates
→ Pass to ranking stage

👉 面试回答

Candidate generation 从大规模 catalog 中检索较小的一组 potentially relevant items。

它使用 collaborative filtering、 content-based retrieval、vector search、 popularity、trending signals 和 business rules。


🔟 Ranking


Ranking Goal

Ranking 按 expected user value 排序 candidates。


Ranking Signals


Ranking Flow

Candidates
→ Feature Enrichment
→ Ranking Model
→ Ranked List
→ Filters
→ Final Recommendations

👉 面试回答

Ranking 使用 user features、item features、 behavioral signals、contextual signals 和 business rules 给 candidate items 打分。

Ranker 通常优化 predicted relevance 或 conversion。


1️⃣1️⃣ Where LLMs Fit


LLMs 通常不是 First-stage Ranker

LLMs 很昂贵。

它们更适合:


Common Pattern

Traditional recommender retrieves top 100
→ LLM re-ranks top 10 or explains results

👉 面试回答

LLMs 通常不用于直接排序 millions of items。

它们最适合在 candidate generation 后, 用于 intent understanding、 metadata enrichment、small-set re-ranking、 explanations 和 conversational recommendation。


1️⃣2️⃣ LLM-based Re-ranking


为什么用 LLM Re-rank?

LLMs 可以理解 user intent 和 item descriptions。


Flow

Top 20 candidates
+ User preference
+ Current query
→ LLM re-ranker
→ Final top 5

Best For


Cost Control

只在 small candidate sets 上使用 LLMs。


👉 面试回答

当 recommendations 依赖 nuanced user intent 时, LLM re-ranking 可以提升质量。

但因为 LLMs 昂贵, 应只在 cheaper retrieval 和 ranking 后的小 candidate set 上使用。


1️⃣3️⃣ Conversational Recommendation


为什么 Conversation 有帮助?

Users 经常不完全知道自己想要什么。

Assistant 可以问 follow-up questions。


Example

User:
Recommend a laptop for AI work.

Assistant:
Do you care more about GPU performance,
battery life,
or portability?

Flow

User preference
→ Clarify constraints
→ Retrieve candidates
→ Rank
→ Explain recommendations

👉 面试回答

LLMs 对 conversational recommendations 特别有用。

它们可以 clarify user intent、 收集 constraints、解释 trade-offs, 并把 vague preferences 转换成 structured recommendation filters。


1️⃣4️⃣ Cold Start


Cold-start Problems

New User

没有 behavior history。


New Item

没有 interaction history。


LLM Help

LLMs 可以使用:


👉 面试回答

LLMs 可以通过 item descriptions、 user-provided preferences、onboarding answers 和 semantic reasoning 帮助解决 cold-start problems, 在有足够 behavioral data 前提供推荐。


1️⃣5️⃣ Feedback Loop


需要收集哪些 Feedback?


Feedback Flow

Recommendation shown
→ User interacts
→ Event logged
→ Feature store updated
→ Models retrained
→ Recommendations improve

👉 面试回答

Recommendation systems 依赖 feedback loops。

系统应该记录 impressions、clicks、 conversions、skips、ratings 和 dwell time, 并用这些 signals 更新 features 和 retrain models。


1️⃣6️⃣ Safety, Fairness, and Business Rules


Important Controls

Recommendations 必须遵守:


Example

Do not recommend unavailable products.

Do not recommend restricted content to underage users.

👉 面试回答

Recommendation systems 需要 safety、fairness 和 business-rule filters。

Final list 应该遵守 privacy、availability、 age rules、legal constraints、 content safety、diversity 和 fairness goals。


1️⃣7️⃣ Evaluation Metrics


Offline Metrics


Online Metrics


LLM-specific Metrics


👉 面试回答

我会用 offline 和 online metrics 同时评估 recommendation systems。

Offline metrics 包括 precision、recall、 NDCG、diversity 和 coverage。

Online metrics 包括 CTR、conversion、 retention、satisfaction 和 long-term engagement。


1️⃣8️⃣ Common Failure Modes


Failure Modes

AI recommendation systems 可能失败因为:


Example

LLM explains that a product matches user preference,
but the product is out of stock.

👉 面试回答

Recommendation systems 可能因为 overfit past behavior、 ignore freshness、放大 popularity bias、 违反 constraints 或产生 hallucinated explanations 而失败。

LLM-generated explanations 必须 grounded in real item attributes。


1️⃣9️⃣ Cost Control


Cost Drivers


Controls


👉 面试回答

Cost control 很重要, 因为 LLM-based recommendation 可能很昂贵。

我会先使用 traditional retrieval 和 ranking, 只在 small candidate sets 或 explanation generation 中使用 LLMs。


2️⃣0️⃣ Best Practices


Practical Rules


Design Principle

Use traditional recommenders for scale.
Use LLMs for reasoning,
conversation,
and explanation.

👉 面试回答

Strong LLM recommendation system 结合 traditional recommender infrastructure 和 LLM reasoning。

Candidate generation 和 ranking 负责 scale。

LLMs 帮助 intent understanding、 conversational clarification、small-set re-ranking、 cold start 和 explanations。


🧠 Staff-Level Answer Final


👉 面试回答完整版本

设计 AI recommendation system with LLMs, 我不会把 LLM 当成整个 recommendation engine。

Production recommender 仍然需要 scalable candidate generation、 ranking、feature stores、user profiles、 item catalogs、feedback loops 和 business-rule filters。

系统首先从 explicit preferences、click history、 purchases、searches、ratings 和 recent session behavior 构建 user profiles。

Items 使用 structured metadata、 behavioral signals、tags、text descriptions、 images、embeddings、freshness、 availability 和 popularity 表示。

Candidate generation 通过 collaborative filtering、 content-based retrieval、vector search、 popularity、trending signals、 graph relationships 和 business campaigns 从大 catalog 中检索 manageable set。

Ranking model 再使用 user features、 item features、context、behavioral signals 和 business goals 给 candidates 打分。

LLMs 最适合放在 scalable retrieval 和 ranking pipeline 之后。

它们可以帮助 understand natural-language intent、 enrich item metadata、解决 cold-start problems、 ask clarifying questions、 re-rank small candidate set 和 generate recommendation explanations。

LLMs 不应该直接 rank millions of items, 因为太慢且太贵。

对 conversational recommendations, LLM 可以把 vague user preferences 转换成 structured constraints, 询问 follow-up questions, 并解释 trade-offs。

对 cold-start users 或 new items, LLM 可以在缺少 interaction data 时, 使用 natural-language preferences 和 item descriptions。

Feedback loop 非常关键。

系统应该收集 impressions、clicks、skips、 purchases、dwell time、ratings 和 user feedback, 然后更新 features 并 retrain ranking models。

Safety 和 fairness filters 应该执行 privacy、availability、 age restrictions、legal constraints、 diversity 和 business rules。

Evaluation 应包含 offline metrics, 比如 precision@K、recall@K、NDCG、 diversity 和 coverage, 也包含 online metrics, 比如 CTR、conversion、retention、 satisfaction 和 long-term engagement。

核心原则是: use traditional recommenders for scale, and use LLMs for reasoning、 conversation 和 explanation。


⭐ Final Insight

LLM Recommendation System 的核心不是:

“把用户信息和所有商品丢给 LLM”

真正的系统是:

User Modeling

  • Item Modeling
  • Candidate Generation
  • Ranking
  • LLM Re-ranking
  • Conversational Clarification
  • Explanation Generation
  • Feedback Loop
  • Safety Filters
  • Evaluation。

传统 recommender 负责 scale。

LLM 负责 reasoning、conversation 和 explanation。

最重要的一句话:

Use traditional recommenders for scale.

Use LLMs for reasoning, conversation, and explanation.


Implement