aaa-psd AI Product & System Design ·

🎯 Design an AI Content Moderation System

1️⃣ Core Framework

When designing an AI Content Moderation System, I frame it as:

Product requirements
Content ingestion
Policy classification
Risk scoring
Automated action
Human review
Appeals and audit
Trade-offs: safety vs false positives vs latency

2️⃣ Product Goal

An AI content moderation system detects unsafe, harmful, illegal, or policy-violating content.

Examples:

Text posts
Comments
Images
Videos
Live chat
User profiles
Uploaded files
Product listings

Basic Flow

User Content
→ Moderation Classifier
→ Risk Score
→ Policy Decision
→ Allow / Block / Review
→ Audit Log

👉 Interview Answer

A content moderation system reviews user-generated content against platform policies.

It uses machine learning, LLMs, rules, risk scoring, and human review to decide whether content should be allowed, blocked, limited, or escalated.

3️⃣ Functional Requirements

Core Features

The system should support:

Text moderation
Image moderation
Video moderation
Spam detection
Abuse detection
Policy classification
Risk scoring
Human review queue
Appeal workflow
Audit logs

Moderation Actions

Allow
Block
Warn
Shadow limit
Age restrict
Send to human review
Remove content
Suspend account

👉 Interview Answer

Core requirements include classifying content, assigning risk scores, taking automated actions, escalating uncertain cases to human reviewers, supporting appeals, and keeping audit logs.

4️⃣ Non-functional Requirements

Important System Qualities

The system should optimize for:

Low latency
High precision
High recall
Scalability
Explainability
Auditability
Policy consistency
User fairness
Reviewer safety

Key Trade-off

Strict moderation
→ Safer platform
→ More false positives

Loose moderation
→ Fewer false positives
→ More harmful content

👉 Interview Answer

Non-functional requirements include low latency, high accuracy, scalability, auditability, explainability, fairness, and policy consistency.

The main trade-off is reducing harmful content while minimizing false positives.

5️⃣ High-Level Architecture

Architecture

Client
→ Content Submission API
→ Pre-processing Service
→ Rule Engine
→ ML / LLM Moderation Service
→ Risk Scoring Service
→ Policy Decision Engine
→ Action Service
→ Human Review Queue
→ Audit / Analytics Pipeline

Core Components

Pre-processing Service

Normalizes content.

Moderation Service

Classifies content into policy categories.

Risk Scoring Service

Calculates severity and confidence.

Decision Engine

Chooses allow, block, or review.

Human Review Queue

Handles uncertain or high-risk cases.

👉 Interview Answer

A moderation system usually includes content ingestion, preprocessing, rule-based checks, ML or LLM classifiers, risk scoring, a policy decision engine, automated actions, human review, appeals, audit logging, and analytics.

6️⃣ Data Model

Main Entities

User
Content
ModerationResult
PolicyCategory
ReviewTask
ReviewerDecision
Appeal
AuditLog

Content Record

{
  "content_id": "content_123",
  "user_id": "user_456",
  "content_type": "text",
  "text": "example comment",
  "created_at": "2026-05-24T10:00:00Z"
}

Moderation Result

{
  "content_id": "content_123",
  "category": "harassment",
  "severity": "medium",
  "confidence": 0.91,
  "action": "review",
  "model_version": "moderation-v3"
}

👉 Interview Answer

I would model users, content, moderation results, policy categories, review tasks, reviewer decisions, appeals, and audit logs.

Moderation results should store category, severity, confidence, action, model version, and timestamp.

7️⃣ Content Ingestion

Supported Inputs

The system may process:

Text
Image
Video
Audio
Links
Files
User metadata
Conversation context

Ingestion Flow

User submits content
→ Store raw content
→ Create moderation job
→ Run moderation pipeline
→ Apply decision

Why Context Matters

The same phrase may be safe or unsafe depending on context.

👉 Interview Answer

Content ingestion accepts user-generated content, stores it, creates moderation jobs, and sends content through the moderation pipeline.

Context is important because moderation decisions often depend on surrounding conversation, user history, and platform policy.

8️⃣ Pre-processing

Text Pre-processing

Normalize unicode
Detect language
Remove obfuscation
Expand slang
Extract URLs
Detect repeated spam patterns

Image / Video Pre-processing

Extract frames
Run OCR
Detect objects
Detect faces when allowed
Extract audio transcript
Generate thumbnails

👉 Interview Answer

Preprocessing prepares content for classification.

For text, this includes normalization, language detection, URL extraction, and obfuscation handling.

For video, this may include frame extraction, OCR, and audio transcription.

9️⃣ Policy Taxonomy

Why Policy Taxonomy Matters

The system needs clear categories.

Examples:

Hate speech
Harassment
Self-harm
Sexual content
Violence
Spam
Fraud
Illegal goods
Misinformation
Privacy violation

Severity Levels

Low
Medium
High
Critical

Example

Category: Harassment
Severity: High
Action: Remove content + review account

👉 Interview Answer

A clear policy taxonomy is essential.

The system should classify content into policy categories and severity levels, then map those results to platform actions.

🔟 Rule-based Moderation

Why Rules Are Useful

Rules are good for deterministic cases.

Examples:

Known spam URLs
Banned keywords
Known scam templates
Repeated posting patterns
Blocked file types
Known malicious accounts

Rule Flow

Content
→ Rule Engine
→ If strong match, take action
→ Else send to model classifier

👉 Interview Answer

Rule-based moderation is useful for deterministic, high-confidence cases like known spam URLs, blocked terms, repeated abuse patterns, and known malicious accounts.

Rules are fast, explainable, and cheap.

1️⃣1️⃣ ML / LLM Moderation

Why Models Are Needed

Rules cannot understand nuance.

Models help detect:

Contextual abuse
Hate speech variants
Subtle threats
Manipulated text
Toxicity
Policy intent
Multi-language violations

Model Output

{
  "categories": ["harassment", "threat"],
  "severity": "high",
  "confidence": 0.94,
  "reason": "direct threat toward another user"
}

👉 Interview Answer

ML and LLM classifiers handle nuanced moderation cases that rules cannot capture.

They classify content into categories, estimate severity, provide confidence scores, and sometimes generate explanations for reviewer support.

1️⃣2️⃣ Risk Scoring

Why Risk Score Is Needed

Classification alone is not enough.

The system should consider:

Content severity
Model confidence
User history
Virality
Audience size
Content type
Platform context
Legal or safety risk

Example

High severity + high confidence
→ Auto block

Medium severity + low confidence
→ Human review

Low severity + low confidence
→ Allow but monitor

👉 Interview Answer

Risk scoring combines model output, severity, confidence, user history, reach, and platform context.

The final action should depend on both violation type and risk level.

1️⃣3️⃣ Policy Decision Engine

Decision Engine Role

The decision engine maps moderation signals to actions.

Decision Flow

Rules result
+ Model score
+ Risk score
+ User trust level
+ Policy rules
→ Final action

Example Policy

If category = violence
and severity = critical
and confidence > 0.9
→ Remove immediately and escalate

👉 Interview Answer

The policy decision engine converts moderation signals into actions.

It combines rules, model scores, severity, confidence, user history, and platform policy to decide whether to allow, block, limit, or review content.

1️⃣4️⃣ Human Review Queue

Why Human Review Is Needed

Models are not perfect.

Human reviewers handle:

Ambiguous cases
High-risk content
Appeals
Low-confidence model output
Policy edge cases
Sensitive topics

Queue Prioritization

Prioritize by:

Severity
Virality
Legal risk
User reports
Confidence uncertainty
VIP / public figure impact

👉 Interview Answer

Human review is needed for ambiguous, high-risk, low-confidence, or policy-sensitive cases.

The review queue should prioritize content by severity, reach, user reports, legal risk, and model uncertainty.

1️⃣5️⃣ Appeals System

Why Appeals Matter

Moderation can make mistakes.

Users need a way to appeal.

Appeal Flow

User appeals decision
→ Create appeal task
→ Human reviewer evaluates
→ Decision upheld or reversed
→ Audit log updated

Why Important

Appeals improve:

User trust
Fairness
Policy quality
Model evaluation data

👉 Interview Answer

Appeals are important because moderation systems make mistakes.

An appeal workflow lets users challenge decisions, supports fairness, and provides valuable data for improving models and policies.

1️⃣6️⃣ Feedback Loop

Feedback Sources

The system learns from:

Human reviewer decisions
User reports
Appeals
False positive analysis
False negative analysis
Model confidence drift
Policy updates

Flow

Moderation decision
→ Human / user feedback
→ Label dataset
→ Model evaluation
→ Model retraining
→ Policy improvement

👉 Interview Answer

Feedback loops are critical for moderation.

Human decisions, appeals, user reports, and false-positive analysis should be used to improve classifiers, thresholds, and policy rules.

1️⃣7️⃣ Latency Strategy

Different Paths

Synchronous Moderation

User submits content
→ Moderate before publishing

Best for high-risk surfaces.

Asynchronous Moderation

Publish content
→ Moderate in background
→ Remove if violation found

Best for lower-risk content.

Trade-off

Synchronous = safer but slower

Asynchronous = faster but riskier

👉 Interview Answer

Moderation can be synchronous or asynchronous.

Synchronous moderation is safer but adds latency.

Asynchronous moderation improves user experience but may allow harmful content to appear briefly.

1️⃣8️⃣ Safety and Reviewer Protection

Reviewer Safety

Human reviewers may see harmful content.

The system should provide:

Blurring for graphic content
Warning labels
Limited exposure
Mental health support
Escalation tools
Reviewer access control

Platform Safety

The system should protect against:

Evasion
Adversarial spelling
Coordinated abuse
Spam attacks
Prompt injection
Model manipulation

👉 Interview Answer

Moderation systems must also protect human reviewers.

The review UI should reduce exposure to harmful content, blur graphic material, provide warnings, and support escalation workflows.

1️⃣9️⃣ Observability

What to Monitor

Moderation latency
Auto-block rate
Human review volume
False positive rate
False negative rate
Appeal reversal rate
Policy category distribution
Model confidence drift
Reviewer throughput
Abuse spikes

Debugging Questions

Which model version made the decision?
Which policy was applied?
Why was content removed?
Was the decision appealed?
Was the model confidence low?

👉 Interview Answer

Observability is essential for moderation systems.

I would track latency, action rates, false positives, false negatives, appeal outcomes, model drift, policy categories, reviewer workload, and abuse spikes.

2️⃣0️⃣ Best Practices

Practical Rules

Use rules for deterministic abuse
Use ML / LLMs for nuanced cases
Store model version and policy version
Use risk-based actions
Escalate uncertain cases to humans
Support appeals
Monitor false positives and negatives
Protect human reviewers
Use feedback to improve models
Keep audit logs for every action

Design Principle

Moderation is not only classification.
It is policy enforcement with feedback,
appeals,
and accountability.

👉 Interview Answer

A production moderation system should combine rules, ML classifiers, LLM reasoning, risk scoring, human review, appeals, audit logs, and feedback loops.

Moderation is policy enforcement, not just content classification.

🧠 Staff-Level Answer Final

👉 Interview Answer Full Version

To design an AI content moderation system, I would treat it as a policy enforcement platform, not just a classifier.

The system receives user-generated content such as text, images, video, audio, links, profiles, files, or product listings.

The content first goes through preprocessing: text normalization, language detection, URL extraction, OCR, video frame extraction, and audio transcription when needed.

Then the system applies a combination of rules, ML classifiers, and LLM-based moderation.

Rule-based checks are useful for deterministic cases like known spam URLs, blocked keywords, scam templates, repeated posting patterns, and known bad accounts.

ML and LLM classifiers handle nuanced cases like harassment, hate speech, threats, self-harm, sexual content, violence, fraud, and policy edge cases.

The moderation result should include policy category, severity, confidence, model version, policy version, and explanation where appropriate.

A risk scoring system combines content severity, model confidence, user history, virality, reach, content type, and legal or platform risk.

The policy decision engine then maps these signals to actions such as allow, block, remove, warn, restrict, or send to human review.

Human review is needed for ambiguous, high-risk, low-confidence, or appealed cases.

Review queues should prioritize by severity, reach, user reports, legal risk, and model uncertainty.

Appeals are important because moderation systems make mistakes.

Appeal outcomes also become valuable labeled data for improving models and policies.

The system can run synchronously for high-risk surfaces, where content must be checked before publishing, or asynchronously for lower-risk surfaces, where content is checked after publishing.

The key trade-off is safety versus latency and false positives.

Observability is critical.

We need to track moderation latency, false positives, false negatives, appeal reversal rate, model drift, policy categories, reviewer load, and abuse spikes.

Finally, the system must protect human reviewers through UI warnings, blurred graphic content, limited exposure, escalation tools, and access control.

The core principle is: moderation is not only classification.

It is policy enforcement with feedback, appeals, and accountability.

⭐ Final Insight

AI Content Moderation 的核心不是：

“用模型判断 safe / unsafe”

真正的系统是：

Content Ingestion

Preprocessing

Policy Taxonomy

Rule Engine

ML / LLM Classification

Risk Scoring

Decision Engine

Human Review

Appeals

Audit Logs

Feedback Loop。

最重要的一句话：

Moderation is not only classification.

It is policy enforcement with feedback, appeals, and accountability.

中文部分

🎯 Design an AI Content Moderation System

1️⃣ 核心框架

设计 AI Content Moderation System 时，我通常从这些方面分析：

Product requirements
Content ingestion
Policy classification
Risk scoring
Automated action
Human review
Appeals and audit
核心权衡：safety vs false positives vs latency

2️⃣ Product Goal

AI content moderation system 用于检测 unsafe、harmful、illegal 或 policy-violating content。

Examples:

Text posts
Comments
Images
Videos
Live chat
User profiles
Uploaded files
Product listings

Basic Flow

User Content
→ Moderation Classifier
→ Risk Score
→ Policy Decision
→ Allow / Block / Review
→ Audit Log

👉 面试回答

Content moderation system 会根据 platform policies 审核 user-generated content。

它使用 machine learning、LLMs、rules、 risk scoring 和 human review 决定 content 应该 allow、block、limit 还是 escalate。

3️⃣ Functional Requirements

Core Features

系统应该支持：

Text moderation
Image moderation
Video moderation
Spam detection
Abuse detection
Policy classification
Risk scoring
Human review queue
Appeal workflow
Audit logs

Moderation Actions

Allow
Block
Warn
Shadow limit
Age restrict
Send to human review
Remove content
Suspend account

👉 面试回答

核心需求包括 classify content、 assign risk scores、take automated actions、 escalate uncertain cases to human reviewers、 support appeals 和 keep audit logs。

4️⃣ Non-functional Requirements

Important System Qualities

系统应该优化：

Low latency
High precision
High recall
Scalability
Explainability
Auditability
Policy consistency
User fairness
Reviewer safety

Key Trade-off

Strict moderation
→ Safer platform
→ More false positives

Loose moderation
→ Fewer false positives
→ More harmful content

👉 面试回答

Non-functional requirements 包括 low latency、 high accuracy、scalability、auditability、 explainability、fairness 和 policy consistency。

核心权衡是减少 harmful content，同时最小化 false positives。

5️⃣ High-Level Architecture

Architecture

Client
→ Content Submission API
→ Pre-processing Service
→ Rule Engine
→ ML / LLM Moderation Service
→ Risk Scoring Service
→ Policy Decision Engine
→ Action Service
→ Human Review Queue
→ Audit / Analytics Pipeline

Core Components

Pre-processing Service

Normalize content。

Moderation Service

把 content 分类到 policy categories。

Risk Scoring Service

计算 severity 和 confidence。

Decision Engine

选择 allow、block 或 review。

Human Review Queue

处理 uncertain 或 high-risk cases。

👉 面试回答

Moderation system 通常包含 content ingestion、 preprocessing、rule-based checks、 ML 或 LLM classifiers、risk scoring、 policy decision engine、automated actions、 human review、appeals、audit logging 和 analytics。

6️⃣ Data Model

Main Entities

User
Content
ModerationResult
PolicyCategory
ReviewTask
ReviewerDecision
Appeal
AuditLog

Content Record

{
  "content_id": "content_123",
  "user_id": "user_456",
  "content_type": "text",
  "text": "example comment",
  "created_at": "2026-05-24T10:00:00Z"
}

Moderation Result

{
  "content_id": "content_123",
  "category": "harassment",
  "severity": "medium",
  "confidence": 0.91,
  "action": "review",
  "model_version": "moderation-v3"
}

👉 面试回答

我会建模 users、content、moderation results、 policy categories、review tasks、 reviewer decisions、appeals 和 audit logs。

Moderation results 应存储 category、severity、 confidence、action、model version 和 timestamp。

7️⃣ Content Ingestion

Supported Inputs

系统可能处理：

Text
Image
Video
Audio
Links
Files
User metadata
Conversation context

Ingestion Flow

User submits content
→ Store raw content
→ Create moderation job
→ Run moderation pipeline
→ Apply decision

为什么 Context 重要？

同一句话在不同 context 下可能 safe 或 unsafe。

👉 面试回答

Content ingestion 接收 user-generated content，存储它，创建 moderation jobs，并把 content 送入 moderation pipeline。

Context 很重要，因为 moderation decisions 经常取决于 surrounding conversation、 user history 和 platform policy。

8️⃣ Pre-processing

Text Pre-processing

Normalize unicode
Detect language
Remove obfuscation
Expand slang
Extract URLs
Detect repeated spam patterns

Image / Video Pre-processing

Extract frames
Run OCR
Detect objects
Detect faces when allowed
Extract audio transcript
Generate thumbnails

👉 面试回答

Preprocessing 为 classification 准备 content。

对 text，包括 normalization、language detection、 URL extraction 和 obfuscation handling。

对 video，可能包括 frame extraction、OCR 和 audio transcription。

9️⃣ Policy Taxonomy

为什么 Policy Taxonomy 重要？

系统需要清晰 categories。

Examples:

Hate speech
Harassment
Self-harm
Sexual content
Violence
Spam
Fraud
Illegal goods
Misinformation
Privacy violation

Severity Levels

Low
Medium
High
Critical

Example

Category: Harassment
Severity: High
Action: Remove content + review account

👉 面试回答

清晰的 policy taxonomy 很重要。

系统应该把 content 分类到 policy categories 和 severity levels，然后把这些结果映射到 platform actions。

🔟 Rule-based Moderation

为什么 Rules 有用？

Rules 适合 deterministic cases。

Examples:

Known spam URLs
Banned keywords
Known scam templates
Repeated posting patterns
Blocked file types
Known malicious accounts

Rule Flow

Content
→ Rule Engine
→ If strong match, take action
→ Else send to model classifier

👉 面试回答

Rule-based moderation 适合 deterministic、high-confidence cases，比如 known spam URLs、blocked terms、 repeated abuse patterns 和 known malicious accounts。

Rules fast、explainable 且便宜。

1️⃣1️⃣ ML / LLM Moderation

为什么需要 Models？

Rules 无法理解 nuance。

Models 可以检测：

Contextual abuse
Hate speech variants
Subtle threats
Manipulated text
Toxicity
Policy intent
Multi-language violations

Model Output

{
  "categories": ["harassment", "threat"],
  "severity": "high",
  "confidence": 0.94,
  "reason": "direct threat toward another user"
}

👉 面试回答

ML 和 LLM classifiers 处理 rules 无法捕捉的 nuanced moderation cases。

它们把 content 分类到 categories，估计 severity，提供 confidence scores，有时也生成 explanations 帮助 reviewers。

1️⃣2️⃣ Risk Scoring

为什么需要 Risk Score？

Classification alone 不够。

系统应该考虑：

Content severity
Model confidence
User history
Virality
Audience size
Content type
Platform context
Legal or safety risk

Example

High severity + high confidence
→ Auto block

Medium severity + low confidence
→ Human review

Low severity + low confidence
→ Allow but monitor

👉 面试回答

Risk scoring 结合 model output、severity、 confidence、user history、reach 和 platform context。

Final action 应同时取决于 violation type 和 risk level。

1️⃣3️⃣ Policy Decision Engine

Decision Engine Role

Decision engine 把 moderation signals 映射到 actions。

Decision Flow

Rules result
+ Model score
+ Risk score
+ User trust level
+ Policy rules
→ Final action

Example Policy

If category = violence
and severity = critical
and confidence > 0.9
→ Remove immediately and escalate

👉 面试回答

Policy decision engine 把 moderation signals 转换成 actions。

它结合 rules、model scores、severity、 confidence、user history 和 platform policy，决定 allow、block、limit 或 review content。

1️⃣4️⃣ Human Review Queue

为什么需要 Human Review？

Models 不完美。

Human reviewers 处理：

Ambiguous cases
High-risk content
Appeals
Low-confidence model output
Policy edge cases
Sensitive topics

Queue Prioritization

按这些优先级排序：

Severity
Virality
Legal risk
User reports
Confidence uncertainty
VIP / public figure impact

👉 面试回答

Human review 用于 ambiguous、high-risk、 low-confidence 或 policy-sensitive cases。

Review queue 应根据 severity、reach、 user reports、legal risk 和 model uncertainty 优先排序。

1️⃣5️⃣ Appeals System

为什么 Appeals 重要？

Moderation 会犯错。

Users 需要 appeal 机制。

Appeal Flow

User appeals decision
→ Create appeal task
→ Human reviewer evaluates
→ Decision upheld or reversed
→ Audit log updated

为什么重要？

Appeals 改善：

User trust
Fairness
Policy quality
Model evaluation data

👉 面试回答

Appeals 很重要，因为 moderation systems 会犯错。

Appeal workflow 让用户可以 challenge decisions，支持 fairness，并提供 valuable data 改进 models 和 policies。

1️⃣6️⃣ Feedback Loop

Feedback Sources

系统从这些学习：

Human reviewer decisions
User reports
Appeals
False positive analysis
False negative analysis
Model confidence drift
Policy updates

Flow

Moderation decision
→ Human / user feedback
→ Label dataset
→ Model evaluation
→ Model retraining
→ Policy improvement

👉 面试回答

Feedback loops 对 moderation 很关键。

Human decisions、appeals、user reports 和 false-positive analysis 应用于改进 classifiers、thresholds 和 policy rules。

1️⃣7️⃣ Latency Strategy

Different Paths

Synchronous Moderation

User submits content
→ Moderate before publishing

适合 high-risk surfaces。

Asynchronous Moderation

Publish content
→ Moderate in background
→ Remove if violation found

适合 lower-risk content。

Trade-off

Synchronous = safer but slower

Asynchronous = faster but riskier

👉 面试回答

Moderation 可以 synchronous 或 asynchronous。

Synchronous moderation 更安全，但会增加 latency。

Asynchronous moderation 改善 user experience，但可能让 harmful content 短暂出现。

1️⃣8️⃣ Safety and Reviewer Protection

Reviewer Safety

Human reviewers 可能看到 harmful content。

系统应该提供：

Blurring for graphic content
Warning labels
Limited exposure
Mental health support
Escalation tools
Reviewer access control

Platform Safety

系统需要防御：

Evasion
Adversarial spelling
Coordinated abuse
Spam attacks
Prompt injection
Model manipulation

👉 面试回答

Moderation systems 也必须保护 human reviewers。

Review UI 应减少 harmful content exposure， blur graphic material，提供 warnings，并支持 escalation workflows。

1️⃣9️⃣ Observability

What to Monitor

Moderation latency
Auto-block rate
Human review volume
False positive rate
False negative rate
Appeal reversal rate
Policy category distribution
Model confidence drift
Reviewer throughput
Abuse spikes

Debugging Questions

哪个 model version 做了 decision？
哪个 policy 被 applied？
为什么 content 被 removed？
Decision 是否 appealed？
Model confidence 是否 low？

👉 面试回答

Observability 对 moderation systems 非常关键。

我会追踪 latency、action rates、 false positives、false negatives、 appeal outcomes、model drift、 policy categories、reviewer workload 和 abuse spikes。

2️⃣0️⃣ Best Practices

Practical Rules

Use rules for deterministic abuse
Use ML / LLMs for nuanced cases
Store model version and policy version
Use risk-based actions
Escalate uncertain cases to humans
Support appeals
Monitor false positives and negatives
Protect human reviewers
Use feedback to improve models
Keep audit logs for every action

Design Principle

Moderation is not only classification.
It is policy enforcement with feedback,
appeals,
and accountability.

👉 面试回答

Production moderation system 应结合 rules、ML classifiers、 LLM reasoning、risk scoring、human review、 appeals、audit logs 和 feedback loops。

Moderation 是 policy enforcement，不只是 content classification。

🧠 Staff-Level Answer Final

👉 面试回答完整版本

设计 AI content moderation system，我会把它看作 policy enforcement platform，而不是简单 classifier。

系统接收 user-generated content，比如 text、images、video、audio、 links、profiles、files 或 product listings。

Content 首先经过 preprocessing： text normalization、language detection、 URL extraction、OCR、video frame extraction，以及需要时的 audio transcription。

然后系统结合 rules、ML classifiers 和 LLM-based moderation。

Rule-based checks 适合 deterministic cases，比如 known spam URLs、blocked keywords、 scam templates、repeated posting patterns 和 known bad accounts。

ML 和 LLM classifiers 处理 nuanced cases，比如 harassment、hate speech、threats、 self-harm、sexual content、violence、 fraud 和 policy edge cases。

Moderation result 应包含 policy category、 severity、confidence、model version、 policy version，以及必要时的 explanation。

Risk scoring system 结合 content severity、model confidence、 user history、virality、reach、content type 和 legal 或 platform risk。

Policy decision engine 再把这些 signals 映射成 actions，比如 allow、block、remove、warn、 restrict 或 send to human review。

Human review 用于 ambiguous、high-risk、 low-confidence 或 appealed cases。

Review queues 应按 severity、reach、 user reports、legal risk 和 model uncertainty 优先排序。

Appeals 很重要，因为 moderation systems 会犯错。

Appeal outcomes 也会成为改进 models 和 policies 的 valuable labeled data。

系统可以在 high-risk surfaces 使用 synchronous moderation，即发布前审核；对 lower-risk surfaces 使用 asynchronous moderation，即发布后后台审核。

核心权衡是 safety 与 latency 和 false positives。

Observability 非常关键。

我们需要追踪 moderation latency、 false positives、false negatives、 appeal reversal rate、model drift、 policy categories、reviewer load 和 abuse spikes。

最后，系统还必须保护 human reviewers，通过 UI warnings、blurred graphic content、 limited exposure、escalation tools 和 access control。

核心原则是： moderation 不只是 classification。

它是带有 feedback、appeals 和 accountability 的 policy enforcement。

⭐ Final Insight

AI Content Moderation 的核心不是：

“用模型判断 safe / unsafe”

真正的系统是：

Content Ingestion

Preprocessing

Policy Taxonomy

Rule Engine

ML / LLM Classification

Risk Scoring

Decision Engine

Human Review

Appeals

Audit Logs

Feedback Loop。

最重要的一句话：

Moderation is not only classification.

It is policy enforcement with feedback, appeals, and accountability.