🎯 Design Notification System
1️⃣ Core Framework
When discussing Notification System design, I frame it as:
- Core flows: event trigger, notification creation, delivery
- Notification channels: email, SMS, push, in-app
- User preferences and subscription rules
- Delivery pipeline and retry strategy
- Deduplication, rate limiting, and batching
- Priority and scheduling
- Scaling, reliability, and observability
- Trade-offs: latency vs reliability vs cost
2️⃣ Core Requirements
Functional Requirements
- Send notifications to users
-
Support multiple channels:
- Push
- SMS
- In-app
- Support user notification preferences
- Support templates
- Support retries
- Support scheduled notifications
- Support delivery status tracking
Non-functional Requirements
- High availability
- Reliable delivery
- Scalable fanout
- Low latency for urgent notifications
- Cost control for SMS/email
- Prevent duplicate or spammy notifications
👉 Interview Answer
A notification system receives events from other services, decides whether a user should be notified, chooses the right channel, and delivers the notification reliably.
The main challenge is balancing reliability, latency, cost, user preferences, and preventing duplicate or excessive notifications.
3️⃣ Main APIs
Send Notification
POST /api/notifications
Request:
{
"userId": "u123",
"type": "ORDER_SHIPPED",
"priority": "HIGH",
"channels": ["push", "email"],
"payload": {
"orderId": "o789"
}
}
Update User Preferences
PUT /api/notification-preferences
Request:
{
"userId": "u123",
"preferences": {
"ORDER_SHIPPED": ["push", "email"],
"MARKETING": ["email"],
"SECURITY_ALERT": ["push", "sms"]
}
}
Get Notification History
GET /api/notifications?userId=u123&cursor=xxx&limit=50
👉 Interview Answer
I would expose APIs for creating notifications, updating user preferences, and reading notification history.
Most production notifications are triggered by internal events, so the system should also consume events from a message queue.
4️⃣ Data Model
Notification Table
notification (
notification_id VARCHAR PRIMARY KEY,
user_id VARCHAR,
type VARCHAR,
priority VARCHAR,
status VARCHAR,
created_at TIMESTAMP,
scheduled_at TIMESTAMP,
payload JSON
)
Notification Delivery Table
notification_delivery (
delivery_id VARCHAR PRIMARY KEY,
notification_id VARCHAR,
user_id VARCHAR,
channel VARCHAR,
status VARCHAR,
attempt_count INT,
last_attempt_at TIMESTAMP,
provider_message_id VARCHAR,
error_code VARCHAR
)
User Preference Table
notification_preference (
user_id VARCHAR,
notification_type VARCHAR,
allowed_channels ARRAY,
enabled BOOLEAN,
quiet_hours JSON,
PRIMARY KEY (user_id, notification_type)
)
Template Table
notification_template (
template_id VARCHAR PRIMARY KEY,
notification_type VARCHAR,
channel VARCHAR,
language VARCHAR,
subject TEXT,
body TEXT
)
Device Token Table
device_token (
user_id VARCHAR,
device_id VARCHAR,
platform VARCHAR,
token TEXT,
status VARCHAR,
updated_at TIMESTAMP,
PRIMARY KEY (user_id, device_id)
)
👉 Interview Answer
I would separate notification records from delivery records.
A notification represents the logical intent, while delivery records track channel-specific delivery attempts, such as push, email, or SMS.
This makes it easier to retry failed channels independently and track delivery status per channel.
5️⃣ Notification Flow
Basic Flow
Business Event
→ Notification Service
→ Preference Check
→ Template Rendering
→ Channel Selection
→ Queue
→ Channel Worker
→ External Provider
→ Delivery Status Update
Example
Order Service emits ORDER_SHIPPED
→ Notification Service receives event
→ Check user preferences
→ Render push/email templates
→ Enqueue delivery jobs
→ Push Worker / Email Worker sends notification
👉 Interview Answer
The notification pipeline starts from a business event.
The notification service validates the event, checks user preferences, renders the message using templates, chooses delivery channels, and sends delivery jobs to channel-specific queues.
Workers then deliver notifications through external providers and update delivery status.
6️⃣ Channel Design
Push Notification
Used for:
- Mobile alerts
- Real-time user engagement
- Security alerts
Pros:
- Fast
- Low cost
Cons:
- Requires valid device token
- Delivery is not always guaranteed
Used for:
- Receipts
- Account updates
- Marketing
- Long-form content
Pros:
- Reliable enough
- Good for detailed messages
Cons:
- Slower
- Can go to spam
SMS
Used for:
- OTP
- Security alerts
- Critical messages
Pros:
- High visibility
Cons:
- Expensive
- Strict rate limits
- Provider dependency
In-app Notification
Used for:
- Notification center
- Persistent alerts
- User history
Pros:
- Durable
- User can revisit
Cons:
- Only visible when user opens app
👉 Interview Answer
Different channels have different trade-offs.
Push is low-latency and low-cost, email is better for durable and detailed communication, SMS is useful for urgent or security-critical messages, and in-app notifications provide a persistent notification history.
I would choose channels based on notification type, priority, user preferences, and cost.
7️⃣ Preference and Policy Engine
What It Checks
- User opt-in / opt-out
- Channel preference
- Quiet hours
- Notification type
- User locale
- Frequency limits
- Legal or compliance rules
Example
SECURITY_ALERT → push + sms, ignore quiet hours
MARKETING → email only, respect quiet hours
ORDER_UPDATE → push + email
👉 Interview Answer
Before sending a notification, the system should check user preferences and policy rules.
Some notifications, such as security alerts, may override quiet hours, while marketing notifications must respect opt-out and frequency limits.
8️⃣ Deduplication and Idempotency
Why Needed?
Business services may retry events.
Without deduplication:
- User may receive duplicate push notifications
- User may receive duplicate emails
- SMS cost can increase
- User experience becomes poor
Dedup Key
Use:
user_id + notification_type + business_entity_id
Example:
u123 + ORDER_SHIPPED + order_789
Idempotency
Notification creation should accept an idempotency key:
{
"idempotencyKey": "ORDER_SHIPPED:o789:u123"
}
👉 Interview Answer
I would make notification creation idempotent.
Since upstream services may retry events, the notification service should deduplicate based on a business key, such as user ID, notification type, and order ID.
This prevents duplicate messages and reduces unnecessary cost.
9️⃣ Retry and Failure Handling
Common Failures
- External provider timeout
- Invalid device token
- Email bounce
- SMS provider rate limit
- Queue backlog
- Template rendering failure
Retry Strategy
Use exponential backoff:
retry after 1 min
retry after 5 min
retry after 30 min
retry after 2 hours
Dead Letter Queue
Use DLQ for:
- Permanent failures
- Max retry exceeded
- Malformed payload
- Provider rejection
👉 Interview Answer
Delivery should be retried for transient failures, such as provider timeouts or rate limits.
I would use exponential backoff and dead-letter queues.
Permanent failures, such as invalid device tokens, should not be retried repeatedly; instead, we should mark the token invalid or record the failure reason.
🔟 Rate Limiting and Batching
Rate Limiting
Apply limits by:
- User
- Notification type
- Channel
- Provider
- Tenant / business service
Why?
- Prevent spam
- Control cost
- Respect provider limits
- Improve user experience
Batching
Instead of sending:
Alice liked your post
Bob liked your post
Charlie liked your post
Send:
Alice, Bob, and Charlie liked your post
👉 Interview Answer
Rate limiting prevents the system from overwhelming users and external providers.
For high-volume notifications, I would batch similar events into a single notification.
This improves user experience and reduces delivery cost.
1️⃣1️⃣ Priority and Scheduling
Priority Levels
| Priority | Example | Behavior |
|---|---|---|
| Critical | Security alert | Send immediately |
| High | Order update | Send quickly |
| Normal | Social notification | Can be delayed |
| Low | Marketing | Batch or schedule |
Scheduling
Support:
- Send immediately
- Send at specific time
- Respect quiet hours
- Batch digest notifications
👉 Interview Answer
Not all notifications should be treated equally.
Security alerts should be delivered immediately, while marketing or digest notifications can be delayed, batched, or scheduled.
Priority queues help ensure urgent notifications are not blocked by low-priority traffic.
1️⃣2️⃣ Scaling Patterns
Pattern 1: Event-driven Architecture
Business services publish events.
Order Service → Kafka → Notification Service
Pattern 2: Channel-specific Queues
Separate queues:
- Push queue
- Email queue
- SMS queue
- In-app queue
Pattern 3: Worker Pools
Scale workers independently by channel.
SMS workers scale differently from push workers
Pattern 4: Template Rendering Service
Separate rendering from delivery.
Pattern 5: Provider Abstraction
Use provider adapters:
Email Provider Adapter
SMS Provider Adapter
Push Provider Adapter
Pattern 6: Multi-provider Fallback
For critical notifications:
Primary SMS provider fails
→ fallback to secondary provider
👉 Interview Answer
To scale the system, I would use an event-driven architecture, channel-specific queues, independently scalable worker pools, and provider adapters.
This allows each notification channel to scale and fail independently.
1️⃣3️⃣ Observability
Key Metrics
- Notification created count
- Delivery success rate
- Provider error rate
- Retry count
- Queue lag
- Delivery latency
- Duplicate suppression count
- User opt-out rate
- SMS/email cost
Logs and Tracing
Track:
event_id
notification_id
delivery_id
provider_message_id
user_id
channel
status
👉 Interview Answer
Observability is critical because notification failures are often silent.
I would track delivery success rate, queue lag, retry count, provider errors, and end-to-end delivery latency.
Each notification should have traceable IDs across event ingestion, rendering, queueing, provider delivery, and status updates.
1️⃣4️⃣ Consistency Model
Stronger Consistency Needed For
- User preferences
- Opt-out rules
- Security notification audit logs
- Idempotency / deduplication
Eventual Consistency Acceptable For
- Delivery status
- Notification history updates
- Retry state
- Analytics counters
👉 Interview Answer
User preferences and opt-out rules need stronger correctness, because sending unwanted notifications can create compliance and user trust issues.
Delivery status and analytics can be eventually consistent, since provider callbacks may arrive later.
1️⃣5️⃣ End-to-End Flow
Normal Notification Flow
Business event
→ Notification service
→ Dedup check
→ Preference check
→ Template rendering
→ Channel selection
→ Channel queue
→ Worker sends notification
→ Provider callback updates status
Retry Flow
Delivery fails
→ classify error
→ retry with backoff if transient
→ DLQ if permanent or max retry exceeded
Digest Flow
Multiple low-priority events
→ aggregate by user and type
→ render digest
→ send scheduled notification
Key Insight
Notification systems are not just message sending systems — they are policy-driven, multi-channel delivery pipelines.
🧠 Staff-Level Answer (Final)
👉 Interview Answer (Full Version)
When designing a notification system, I think of it as a policy-driven, multi-channel delivery pipeline.
Business services publish events, and the notification service decides whether a user should be notified, through which channel, and when the notification should be delivered.
The system should support push, email, SMS, and in-app notifications.
I would separate the logical notification record from channel-specific delivery records, so each channel can be retried and tracked independently.
Before sending, the system should perform deduplication, check user preferences, apply policy rules, render templates, and enqueue delivery jobs.
Delivery workers should be channel-specific and should integrate with external providers through provider adapters.
For reliability, I would use retries with exponential backoff, dead-letter queues, idempotency keys, and provider fallback for critical notifications.
To protect user experience and cost, I would add rate limiting, batching, quiet hours, and priority queues.
The main trade-offs are latency, reliability, cost, user preference correctness, and provider dependency.
Ultimately, the goal is to deliver the right message to the right user, through the right channel, at the right time.
⭐ Final Insight
A notification system is not just about sending messages — it is about reliable, policy-aware, multi-channel delivery at scale.
中文部分
🎯 Design Notification System
1️⃣ 核心框架
在设计 Notification System 时,我通常从以下几个方面来分析:
- 核心流程:事件触发、通知创建、通知投递
- 通知渠道:email、SMS、push、in-app
- 用户偏好和订阅规则
- 投递 pipeline 和重试策略
- 去重、限流和批量合并
- 优先级和定时发送
- 扩展性、可靠性和可观测性
- 核心权衡:延迟 vs 可靠性 vs 成本
2️⃣ 核心需求
功能需求
- 向用户发送通知
-
支持多个渠道:
- Push
- SMS
- In-app
- 支持用户通知偏好
- 支持模板
- 支持失败重试
- 支持定时通知
- 支持投递状态追踪
非功能需求
- 高可用
- 可靠投递
- 支持大规模 fanout
- 紧急通知低延迟
- 控制 SMS / email 成本
- 防止重复通知和骚扰用户
👉 面试回答
Notification System 接收来自其他业务服务的事件, 判断是否应该通知用户, 选择合适的渠道, 并可靠地完成通知投递。
核心挑战是在可靠性、延迟、成本、 用户偏好和防重复通知之间做平衡。
3️⃣ 主要 API
发送通知
POST /api/notifications
Request:
{
"userId": "u123",
"type": "ORDER_SHIPPED",
"priority": "HIGH",
"channels": ["push", "email"],
"payload": {
"orderId": "o789"
}
}
更新用户偏好
PUT /api/notification-preferences
Request:
{
"userId": "u123",
"preferences": {
"ORDER_SHIPPED": ["push", "email"],
"MARKETING": ["email"],
"SECURITY_ALERT": ["push", "sms"]
}
}
获取通知历史
GET /api/notifications?userId=u123&cursor=xxx&limit=50
👉 面试回答
我会提供创建通知、更新用户偏好、 读取通知历史的 API。
不过在生产系统中, 大多数通知是由内部业务事件触发的, 所以 notification system 也应该从消息队列中消费事件。
4️⃣ 数据模型
Notification Table
notification (
notification_id VARCHAR PRIMARY KEY,
user_id VARCHAR,
type VARCHAR,
priority VARCHAR,
status VARCHAR,
created_at TIMESTAMP,
scheduled_at TIMESTAMP,
payload JSON
)
Notification Delivery Table
notification_delivery (
delivery_id VARCHAR PRIMARY KEY,
notification_id VARCHAR,
user_id VARCHAR,
channel VARCHAR,
status VARCHAR,
attempt_count INT,
last_attempt_at TIMESTAMP,
provider_message_id VARCHAR,
error_code VARCHAR
)
User Preference Table
notification_preference (
user_id VARCHAR,
notification_type VARCHAR,
allowed_channels ARRAY,
enabled BOOLEAN,
quiet_hours JSON,
PRIMARY KEY (user_id, notification_type)
)
Template Table
notification_template (
template_id VARCHAR PRIMARY KEY,
notification_type VARCHAR,
channel VARCHAR,
language VARCHAR,
subject TEXT,
body TEXT
)
Device Token Table
device_token (
user_id VARCHAR,
device_id VARCHAR,
platform VARCHAR,
token TEXT,
status VARCHAR,
updated_at TIMESTAMP,
PRIMARY KEY (user_id, device_id)
)
👉 面试回答
我会将 notification record 和 delivery record 分开。
Notification 表示一次逻辑上的通知意图, delivery record 则追踪每个渠道的具体投递尝试, 例如 push、email 或 SMS。
这样可以让不同渠道独立重试, 也方便分别追踪投递状态。
5️⃣ 通知流程
基本流程
Business Event
→ Notification Service
→ Preference Check
→ Template Rendering
→ Channel Selection
→ Queue
→ Channel Worker
→ External Provider
→ Delivery Status Update
示例
Order Service emits ORDER_SHIPPED
→ Notification Service receives event
→ Check user preferences
→ Render push/email templates
→ Enqueue delivery jobs
→ Push Worker / Email Worker sends notification
👉 面试回答
通知 pipeline 从业务事件开始。
Notification service 会校验事件、 检查用户偏好、 使用模板渲染消息、 选择投递渠道, 然后将投递任务放入不同渠道的队列。
Worker 再通过外部 provider 发送通知, 并更新投递状态。
6️⃣ 渠道设计
Push Notification
适用于:
- 移动端提醒
- 实时用户互动
- 安全提醒
优点:
- 快
- 成本低
缺点:
- 依赖有效 device token
- 不保证一定送达
适用于:
- 收据
- 账号更新
- Marketing
- 长内容通知
优点:
- 相对可靠
- 适合详细内容
缺点:
- 较慢
- 可能进入垃圾邮件
SMS
适用于:
- OTP
- 安全提醒
- 关键通知
优点:
- 可见度高
缺点:
- 成本高
- Provider 限流严格
- 依赖外部供应商
In-app Notification
适用于:
- Notification center
- 持久化提醒
- 用户历史记录
优点:
- 可持久化
- 用户可以回看
缺点:
- 只有用户打开 app 时才能看到
👉 面试回答
不同通知渠道有不同权衡。
Push 低延迟、成本低; Email 适合持久和详细通知; SMS 适合紧急或安全关键通知; In-app notification 则提供可回看的通知历史。
我会根据通知类型、优先级、用户偏好和成本选择渠道。
7️⃣ Preference and Policy Engine
检查内容
- 用户是否 opt-in / opt-out
- 渠道偏好
- Quiet hours
- 通知类型
- 用户语言 / 地区
- 频率限制
- 法规或合规规则
示例
SECURITY_ALERT → push + sms, ignore quiet hours
MARKETING → email only, respect quiet hours
ORDER_UPDATE → push + email
👉 面试回答
在发送通知前, 系统必须检查用户偏好和 policy rules。
有些通知,比如安全提醒, 可以绕过 quiet hours; 但 marketing 通知必须遵守 opt-out 和频率限制。
8️⃣ 去重和幂等
为什么需要?
业务服务可能会重试事件。
如果没有去重:
- 用户可能收到重复 push
- 用户可能收到重复 email
- SMS 成本会上升
- 用户体验变差
Dedup Key
使用:
user_id + notification_type + business_entity_id
示例:
u123 + ORDER_SHIPPED + order_789
Idempotency
创建通知时支持 idempotency key:
{
"idempotencyKey": "ORDER_SHIPPED:o789:u123"
}
👉 面试回答
我会让 notification creation 支持幂等。
因为上游服务可能会重试事件, notification service 应该基于业务 key 去重, 例如 user ID、notification type 和 order ID。
这样可以避免重复消息, 也可以减少不必要的成本。
9️⃣ 重试和故障处理
常见失败
- 外部 provider timeout
- 无效 device token
- Email bounce
- SMS provider rate limit
- Queue backlog
- Template rendering failure
Retry Strategy
使用指数退避:
retry after 1 min
retry after 5 min
retry after 30 min
retry after 2 hours
Dead Letter Queue
DLQ 用于:
- 永久失败
- 超过最大重试次数
- Payload 格式错误
- Provider 拒绝
👉 面试回答
对于临时失败, 例如 provider timeout 或 rate limit, 系统应该进行重试。
我会使用 exponential backoff 和 dead-letter queue。
对于永久失败, 例如 invalid device token, 不应该无限重试, 而应该标记 token 无效或记录失败原因。
🔟 限流和批量合并
Rate Limiting
可以按以下维度限流:
- User
- Notification type
- Channel
- Provider
- Tenant / business service
为什么需要?
- 防止骚扰用户
- 控制成本
- 遵守 provider 限制
- 改善用户体验
Batching
不要连续发送:
Alice liked your post
Bob liked your post
Charlie liked your post
可以合并成:
Alice, Bob, and Charlie liked your post
👉 面试回答
限流可以防止系统过度打扰用户, 也可以避免压垮外部 provider。
对于高频通知, 我会将相似事件合并成一条通知。
这样可以提升用户体验, 也可以降低投递成本。
1️⃣1️⃣ 优先级和定时发送
Priority Levels
| Priority | Example | Behavior |
|---|---|---|
| Critical | Security alert | 立即发送 |
| High | Order update | 快速发送 |
| Normal | Social notification | 可以稍微延迟 |
| Low | Marketing | 批量或定时发送 |
Scheduling
支持:
- 立即发送
- 指定时间发送
- 遵守 quiet hours
- Digest notifications
👉 面试回答
不是所有通知都应该同等处理。
Security alerts 应该立即发送, 而 marketing 或 digest 通知可以延迟、合并或定时发送。
Priority queues 可以确保紧急通知不会被低优先级流量阻塞。
1️⃣2️⃣ 扩展模式
Pattern 1: Event-driven Architecture
业务服务发布事件:
Order Service → Kafka → Notification Service
Pattern 2: Channel-specific Queues
拆分队列:
- Push queue
- Email queue
- SMS queue
- In-app queue
Pattern 3: Worker Pools
不同渠道的 workers 独立扩展。
SMS workers scale differently from push workers
Pattern 4: Template Rendering Service
将模板渲染和投递逻辑分离。
Pattern 5: Provider Abstraction
使用 provider adapters:
Email Provider Adapter
SMS Provider Adapter
Push Provider Adapter
Pattern 6: Multi-provider Fallback
对于关键通知:
Primary SMS provider fails
→ fallback to secondary provider
👉 面试回答
为了扩展系统, 我会使用 event-driven architecture、 channel-specific queues、 可独立扩展的 worker pools 和 provider adapters。
这样每个通知渠道都可以独立扩展, 也可以独立失败和恢复。
1️⃣3️⃣ 可观测性
核心指标
- Notification created count
- Delivery success rate
- Provider error rate
- Retry count
- Queue lag
- Delivery latency
- Duplicate suppression count
- User opt-out rate
- SMS/email cost
Logs and Tracing
追踪字段:
event_id
notification_id
delivery_id
provider_message_id
user_id
channel
status
👉 面试回答
可观测性非常关键, 因为 notification failure 经常是静默发生的。
我会追踪 delivery success rate、queue lag、 retry count、provider errors 和端到端 delivery latency。
每一条通知都应该有可追踪 ID, 覆盖 event ingestion、template rendering、 queueing、provider delivery 和 status update 全流程。
1️⃣4️⃣ 一致性模型
需要较强一致性的场景
- User preferences
- Opt-out rules
- Security notification audit logs
- Idempotency / deduplication
可以最终一致的场景
- Delivery status
- Notification history updates
- Retry state
- Analytics counters
👉 面试回答
User preferences 和 opt-out rules 需要更强正确性, 因为发送用户不想要的通知可能影响合规和用户信任。
Delivery status 和 analytics 可以最终一致, 因为 provider callback 可能会延迟返回。
1️⃣5️⃣ End-to-End Flow
Normal Notification Flow
Business event
→ Notification service
→ Dedup check
→ Preference check
→ Template rendering
→ Channel selection
→ Channel queue
→ Worker sends notification
→ Provider callback updates status
Retry Flow
Delivery fails
→ classify error
→ retry with backoff if transient
→ DLQ if permanent or max retry exceeded
Digest Flow
Multiple low-priority events
→ aggregate by user and type
→ render digest
→ send scheduled notification
Key Insight
Notification System 不只是发送消息, 它是一个由规则驱动的多渠道投递 pipeline。
🧠 Staff-Level Answer(最终版)
👉 面试回答(完整背诵版)
在设计 Notification System 时, 我会把它看作一个由 policy 驱动的多渠道投递 pipeline。
业务服务发布事件, notification service 决定是否应该通知用户、 使用哪个渠道、 以及何时投递。
系统需要支持 push、email、SMS 和 in-app notifications。
我会将逻辑上的 notification record 和具体渠道的 delivery records 分开, 这样每个渠道可以独立重试和追踪状态。
在发送之前, 系统需要进行去重、 检查用户偏好、 应用 policy rules、 渲染模板, 并将投递任务放入队列。
Delivery workers 应该按渠道拆分, 并通过 provider adapters 集成外部服务。
为了可靠性, 我会使用 exponential backoff retry、 dead-letter queues、 idempotency keys, 并为关键通知设计 provider fallback。
为了保护用户体验和控制成本, 我会加入 rate limiting、batching、 quiet hours 和 priority queues。
核心权衡包括延迟、可靠性、成本、 用户偏好正确性和外部 provider 依赖。
最终目标是在合适的时间, 通过合适的渠道, 将正确的消息可靠地发送给正确的用户。
⭐ Final Insight
Notification System 的核心不只是发送消息, 而是在大规模下实现可靠、合规、可控的多渠道投递。
Implement