System Design Deep Dive - 05 Design Notification System

Post by ailswan April. 29, 2026

中文 ↓

🎯 Design Notification System


1️⃣ Core Framework

When discussing Notification System design, I frame it as:

  1. Core flows: event trigger, notification creation, delivery
  2. Notification channels: email, SMS, push, in-app
  3. User preferences and subscription rules
  4. Delivery pipeline and retry strategy
  5. Deduplication, rate limiting, and batching
  6. Priority and scheduling
  7. Scaling, reliability, and observability
  8. Trade-offs: latency vs reliability vs cost

2️⃣ Core Requirements


Functional Requirements


Non-functional Requirements


👉 Interview Answer

A notification system receives events from other services, decides whether a user should be notified, chooses the right channel, and delivers the notification reliably.

The main challenge is balancing reliability, latency, cost, user preferences, and preventing duplicate or excessive notifications.


3️⃣ Main APIs


Send Notification

POST /api/notifications

Request:

{
  "userId": "u123",
  "type": "ORDER_SHIPPED",
  "priority": "HIGH",
  "channels": ["push", "email"],
  "payload": {
    "orderId": "o789"
  }
}

Update User Preferences

PUT /api/notification-preferences

Request:

{
  "userId": "u123",
  "preferences": {
    "ORDER_SHIPPED": ["push", "email"],
    "MARKETING": ["email"],
    "SECURITY_ALERT": ["push", "sms"]
  }
}

Get Notification History

GET /api/notifications?userId=u123&cursor=xxx&limit=50

👉 Interview Answer

I would expose APIs for creating notifications, updating user preferences, and reading notification history.

Most production notifications are triggered by internal events, so the system should also consume events from a message queue.


4️⃣ Data Model


Notification Table

notification (
  notification_id VARCHAR PRIMARY KEY,
  user_id VARCHAR,
  type VARCHAR,
  priority VARCHAR,
  status VARCHAR,
  created_at TIMESTAMP,
  scheduled_at TIMESTAMP,
  payload JSON
)

Notification Delivery Table

notification_delivery (
  delivery_id VARCHAR PRIMARY KEY,
  notification_id VARCHAR,
  user_id VARCHAR,
  channel VARCHAR,
  status VARCHAR,
  attempt_count INT,
  last_attempt_at TIMESTAMP,
  provider_message_id VARCHAR,
  error_code VARCHAR
)

User Preference Table

notification_preference (
  user_id VARCHAR,
  notification_type VARCHAR,
  allowed_channels ARRAY,
  enabled BOOLEAN,
  quiet_hours JSON,
  PRIMARY KEY (user_id, notification_type)
)

Template Table

notification_template (
  template_id VARCHAR PRIMARY KEY,
  notification_type VARCHAR,
  channel VARCHAR,
  language VARCHAR,
  subject TEXT,
  body TEXT
)

Device Token Table

device_token (
  user_id VARCHAR,
  device_id VARCHAR,
  platform VARCHAR,
  token TEXT,
  status VARCHAR,
  updated_at TIMESTAMP,
  PRIMARY KEY (user_id, device_id)
)

👉 Interview Answer

I would separate notification records from delivery records.

A notification represents the logical intent, while delivery records track channel-specific delivery attempts, such as push, email, or SMS.

This makes it easier to retry failed channels independently and track delivery status per channel.


5️⃣ Notification Flow


Basic Flow

Business Event
→ Notification Service
→ Preference Check
→ Template Rendering
→ Channel Selection
→ Queue
→ Channel Worker
→ External Provider
→ Delivery Status Update

Example

Order Service emits ORDER_SHIPPED
→ Notification Service receives event
→ Check user preferences
→ Render push/email templates
→ Enqueue delivery jobs
→ Push Worker / Email Worker sends notification

👉 Interview Answer

The notification pipeline starts from a business event.

The notification service validates the event, checks user preferences, renders the message using templates, chooses delivery channels, and sends delivery jobs to channel-specific queues.

Workers then deliver notifications through external providers and update delivery status.


6️⃣ Channel Design


Push Notification

Used for:

Pros:

Cons:


Email

Used for:

Pros:

Cons:


SMS

Used for:

Pros:

Cons:


In-app Notification

Used for:

Pros:

Cons:


👉 Interview Answer

Different channels have different trade-offs.

Push is low-latency and low-cost, email is better for durable and detailed communication, SMS is useful for urgent or security-critical messages, and in-app notifications provide a persistent notification history.

I would choose channels based on notification type, priority, user preferences, and cost.


7️⃣ Preference and Policy Engine


What It Checks


Example

SECURITY_ALERT → push + sms, ignore quiet hours
MARKETING → email only, respect quiet hours
ORDER_UPDATE → push + email

👉 Interview Answer

Before sending a notification, the system should check user preferences and policy rules.

Some notifications, such as security alerts, may override quiet hours, while marketing notifications must respect opt-out and frequency limits.


8️⃣ Deduplication and Idempotency


Why Needed?

Business services may retry events.

Without deduplication:


Dedup Key

Use:

user_id + notification_type + business_entity_id

Example:

u123 + ORDER_SHIPPED + order_789

Idempotency

Notification creation should accept an idempotency key:

{
  "idempotencyKey": "ORDER_SHIPPED:o789:u123"
}

👉 Interview Answer

I would make notification creation idempotent.

Since upstream services may retry events, the notification service should deduplicate based on a business key, such as user ID, notification type, and order ID.

This prevents duplicate messages and reduces unnecessary cost.


9️⃣ Retry and Failure Handling


Common Failures


Retry Strategy

Use exponential backoff:

retry after 1 min
retry after 5 min
retry after 30 min
retry after 2 hours

Dead Letter Queue

Use DLQ for:


👉 Interview Answer

Delivery should be retried for transient failures, such as provider timeouts or rate limits.

I would use exponential backoff and dead-letter queues.

Permanent failures, such as invalid device tokens, should not be retried repeatedly; instead, we should mark the token invalid or record the failure reason.


🔟 Rate Limiting and Batching


Rate Limiting

Apply limits by:


Why?


Batching

Instead of sending:

Alice liked your post
Bob liked your post
Charlie liked your post

Send:

Alice, Bob, and Charlie liked your post

👉 Interview Answer

Rate limiting prevents the system from overwhelming users and external providers.

For high-volume notifications, I would batch similar events into a single notification.

This improves user experience and reduces delivery cost.


1️⃣1️⃣ Priority and Scheduling


Priority Levels

Priority Example Behavior
Critical Security alert Send immediately
High Order update Send quickly
Normal Social notification Can be delayed
Low Marketing Batch or schedule

Scheduling

Support:


👉 Interview Answer

Not all notifications should be treated equally.

Security alerts should be delivered immediately, while marketing or digest notifications can be delayed, batched, or scheduled.

Priority queues help ensure urgent notifications are not blocked by low-priority traffic.


1️⃣2️⃣ Scaling Patterns


Pattern 1: Event-driven Architecture

Business services publish events.

Order Service → Kafka → Notification Service

Pattern 2: Channel-specific Queues

Separate queues:


Pattern 3: Worker Pools

Scale workers independently by channel.

SMS workers scale differently from push workers

Pattern 4: Template Rendering Service

Separate rendering from delivery.


Pattern 5: Provider Abstraction

Use provider adapters:

Email Provider Adapter
SMS Provider Adapter
Push Provider Adapter

Pattern 6: Multi-provider Fallback

For critical notifications:

Primary SMS provider fails
→ fallback to secondary provider

👉 Interview Answer

To scale the system, I would use an event-driven architecture, channel-specific queues, independently scalable worker pools, and provider adapters.

This allows each notification channel to scale and fail independently.


1️⃣3️⃣ Observability


Key Metrics


Logs and Tracing

Track:

event_id
notification_id
delivery_id
provider_message_id
user_id
channel
status

👉 Interview Answer

Observability is critical because notification failures are often silent.

I would track delivery success rate, queue lag, retry count, provider errors, and end-to-end delivery latency.

Each notification should have traceable IDs across event ingestion, rendering, queueing, provider delivery, and status updates.


1️⃣4️⃣ Consistency Model


Stronger Consistency Needed For


Eventual Consistency Acceptable For


👉 Interview Answer

User preferences and opt-out rules need stronger correctness, because sending unwanted notifications can create compliance and user trust issues.

Delivery status and analytics can be eventually consistent, since provider callbacks may arrive later.


1️⃣5️⃣ End-to-End Flow


Normal Notification Flow

Business event
→ Notification service
→ Dedup check
→ Preference check
→ Template rendering
→ Channel selection
→ Channel queue
→ Worker sends notification
→ Provider callback updates status

Retry Flow

Delivery fails
→ classify error
→ retry with backoff if transient
→ DLQ if permanent or max retry exceeded

Digest Flow

Multiple low-priority events
→ aggregate by user and type
→ render digest
→ send scheduled notification

Key Insight

Notification systems are not just message sending systems — they are policy-driven, multi-channel delivery pipelines.


🧠 Staff-Level Answer (Final)


👉 Interview Answer (Full Version)

When designing a notification system, I think of it as a policy-driven, multi-channel delivery pipeline.

Business services publish events, and the notification service decides whether a user should be notified, through which channel, and when the notification should be delivered.

The system should support push, email, SMS, and in-app notifications.

I would separate the logical notification record from channel-specific delivery records, so each channel can be retried and tracked independently.

Before sending, the system should perform deduplication, check user preferences, apply policy rules, render templates, and enqueue delivery jobs.

Delivery workers should be channel-specific and should integrate with external providers through provider adapters.

For reliability, I would use retries with exponential backoff, dead-letter queues, idempotency keys, and provider fallback for critical notifications.

To protect user experience and cost, I would add rate limiting, batching, quiet hours, and priority queues.

The main trade-offs are latency, reliability, cost, user preference correctness, and provider dependency.

Ultimately, the goal is to deliver the right message to the right user, through the right channel, at the right time.


⭐ Final Insight

A notification system is not just about sending messages — it is about reliable, policy-aware, multi-channel delivery at scale.



中文部分


🎯 Design Notification System


1️⃣ 核心框架

在设计 Notification System 时,我通常从以下几个方面来分析:

  1. 核心流程:事件触发、通知创建、通知投递
  2. 通知渠道:email、SMS、push、in-app
  3. 用户偏好和订阅规则
  4. 投递 pipeline 和重试策略
  5. 去重、限流和批量合并
  6. 优先级和定时发送
  7. 扩展性、可靠性和可观测性
  8. 核心权衡:延迟 vs 可靠性 vs 成本

2️⃣ 核心需求


功能需求


非功能需求


👉 面试回答

Notification System 接收来自其他业务服务的事件, 判断是否应该通知用户, 选择合适的渠道, 并可靠地完成通知投递。

核心挑战是在可靠性、延迟、成本、 用户偏好和防重复通知之间做平衡。


3️⃣ 主要 API


发送通知

POST /api/notifications

Request:

{
  "userId": "u123",
  "type": "ORDER_SHIPPED",
  "priority": "HIGH",
  "channels": ["push", "email"],
  "payload": {
    "orderId": "o789"
  }
}

更新用户偏好

PUT /api/notification-preferences

Request:

{
  "userId": "u123",
  "preferences": {
    "ORDER_SHIPPED": ["push", "email"],
    "MARKETING": ["email"],
    "SECURITY_ALERT": ["push", "sms"]
  }
}

获取通知历史

GET /api/notifications?userId=u123&cursor=xxx&limit=50

👉 面试回答

我会提供创建通知、更新用户偏好、 读取通知历史的 API。

不过在生产系统中, 大多数通知是由内部业务事件触发的, 所以 notification system 也应该从消息队列中消费事件。


4️⃣ 数据模型


Notification Table

notification (
  notification_id VARCHAR PRIMARY KEY,
  user_id VARCHAR,
  type VARCHAR,
  priority VARCHAR,
  status VARCHAR,
  created_at TIMESTAMP,
  scheduled_at TIMESTAMP,
  payload JSON
)

Notification Delivery Table

notification_delivery (
  delivery_id VARCHAR PRIMARY KEY,
  notification_id VARCHAR,
  user_id VARCHAR,
  channel VARCHAR,
  status VARCHAR,
  attempt_count INT,
  last_attempt_at TIMESTAMP,
  provider_message_id VARCHAR,
  error_code VARCHAR
)

User Preference Table

notification_preference (
  user_id VARCHAR,
  notification_type VARCHAR,
  allowed_channels ARRAY,
  enabled BOOLEAN,
  quiet_hours JSON,
  PRIMARY KEY (user_id, notification_type)
)

Template Table

notification_template (
  template_id VARCHAR PRIMARY KEY,
  notification_type VARCHAR,
  channel VARCHAR,
  language VARCHAR,
  subject TEXT,
  body TEXT
)

Device Token Table

device_token (
  user_id VARCHAR,
  device_id VARCHAR,
  platform VARCHAR,
  token TEXT,
  status VARCHAR,
  updated_at TIMESTAMP,
  PRIMARY KEY (user_id, device_id)
)

👉 面试回答

我会将 notification record 和 delivery record 分开。

Notification 表示一次逻辑上的通知意图, delivery record 则追踪每个渠道的具体投递尝试, 例如 push、email 或 SMS。

这样可以让不同渠道独立重试, 也方便分别追踪投递状态。


5️⃣ 通知流程


基本流程

Business Event
→ Notification Service
→ Preference Check
→ Template Rendering
→ Channel Selection
→ Queue
→ Channel Worker
→ External Provider
→ Delivery Status Update

示例

Order Service emits ORDER_SHIPPED
→ Notification Service receives event
→ Check user preferences
→ Render push/email templates
→ Enqueue delivery jobs
→ Push Worker / Email Worker sends notification

👉 面试回答

通知 pipeline 从业务事件开始。

Notification service 会校验事件、 检查用户偏好、 使用模板渲染消息、 选择投递渠道, 然后将投递任务放入不同渠道的队列。

Worker 再通过外部 provider 发送通知, 并更新投递状态。


6️⃣ 渠道设计


Push Notification

适用于:

优点:

缺点:


Email

适用于:

优点:

缺点:


SMS

适用于:

优点:

缺点:


In-app Notification

适用于:

优点:

缺点:


👉 面试回答

不同通知渠道有不同权衡。

Push 低延迟、成本低; Email 适合持久和详细通知; SMS 适合紧急或安全关键通知; In-app notification 则提供可回看的通知历史。

我会根据通知类型、优先级、用户偏好和成本选择渠道。


7️⃣ Preference and Policy Engine


检查内容


示例

SECURITY_ALERT → push + sms, ignore quiet hours
MARKETING → email only, respect quiet hours
ORDER_UPDATE → push + email

👉 面试回答

在发送通知前, 系统必须检查用户偏好和 policy rules。

有些通知,比如安全提醒, 可以绕过 quiet hours; 但 marketing 通知必须遵守 opt-out 和频率限制。


8️⃣ 去重和幂等


为什么需要?

业务服务可能会重试事件。

如果没有去重:


Dedup Key

使用:

user_id + notification_type + business_entity_id

示例:

u123 + ORDER_SHIPPED + order_789

Idempotency

创建通知时支持 idempotency key:

{
  "idempotencyKey": "ORDER_SHIPPED:o789:u123"
}

👉 面试回答

我会让 notification creation 支持幂等。

因为上游服务可能会重试事件, notification service 应该基于业务 key 去重, 例如 user ID、notification type 和 order ID。

这样可以避免重复消息, 也可以减少不必要的成本。


9️⃣ 重试和故障处理


常见失败


Retry Strategy

使用指数退避:

retry after 1 min
retry after 5 min
retry after 30 min
retry after 2 hours

Dead Letter Queue

DLQ 用于:


👉 面试回答

对于临时失败, 例如 provider timeout 或 rate limit, 系统应该进行重试。

我会使用 exponential backoff 和 dead-letter queue。

对于永久失败, 例如 invalid device token, 不应该无限重试, 而应该标记 token 无效或记录失败原因。


🔟 限流和批量合并


Rate Limiting

可以按以下维度限流:


为什么需要?


Batching

不要连续发送:

Alice liked your post
Bob liked your post
Charlie liked your post

可以合并成:

Alice, Bob, and Charlie liked your post

👉 面试回答

限流可以防止系统过度打扰用户, 也可以避免压垮外部 provider。

对于高频通知, 我会将相似事件合并成一条通知。

这样可以提升用户体验, 也可以降低投递成本。


1️⃣1️⃣ 优先级和定时发送


Priority Levels

Priority Example Behavior
Critical Security alert 立即发送
High Order update 快速发送
Normal Social notification 可以稍微延迟
Low Marketing 批量或定时发送

Scheduling

支持:


👉 面试回答

不是所有通知都应该同等处理。

Security alerts 应该立即发送, 而 marketing 或 digest 通知可以延迟、合并或定时发送。

Priority queues 可以确保紧急通知不会被低优先级流量阻塞。


1️⃣2️⃣ 扩展模式


Pattern 1: Event-driven Architecture

业务服务发布事件:

Order Service → Kafka → Notification Service

Pattern 2: Channel-specific Queues

拆分队列:


Pattern 3: Worker Pools

不同渠道的 workers 独立扩展。

SMS workers scale differently from push workers

Pattern 4: Template Rendering Service

将模板渲染和投递逻辑分离。


Pattern 5: Provider Abstraction

使用 provider adapters:

Email Provider Adapter
SMS Provider Adapter
Push Provider Adapter

Pattern 6: Multi-provider Fallback

对于关键通知:

Primary SMS provider fails
→ fallback to secondary provider

👉 面试回答

为了扩展系统, 我会使用 event-driven architecture、 channel-specific queues、 可独立扩展的 worker pools 和 provider adapters。

这样每个通知渠道都可以独立扩展, 也可以独立失败和恢复。


1️⃣3️⃣ 可观测性


核心指标


Logs and Tracing

追踪字段:

event_id
notification_id
delivery_id
provider_message_id
user_id
channel
status

👉 面试回答

可观测性非常关键, 因为 notification failure 经常是静默发生的。

我会追踪 delivery success rate、queue lag、 retry count、provider errors 和端到端 delivery latency。

每一条通知都应该有可追踪 ID, 覆盖 event ingestion、template rendering、 queueing、provider delivery 和 status update 全流程。


1️⃣4️⃣ 一致性模型


需要较强一致性的场景


可以最终一致的场景


👉 面试回答

User preferences 和 opt-out rules 需要更强正确性, 因为发送用户不想要的通知可能影响合规和用户信任。

Delivery status 和 analytics 可以最终一致, 因为 provider callback 可能会延迟返回。


1️⃣5️⃣ End-to-End Flow


Normal Notification Flow

Business event
→ Notification service
→ Dedup check
→ Preference check
→ Template rendering
→ Channel selection
→ Channel queue
→ Worker sends notification
→ Provider callback updates status

Retry Flow

Delivery fails
→ classify error
→ retry with backoff if transient
→ DLQ if permanent or max retry exceeded

Digest Flow

Multiple low-priority events
→ aggregate by user and type
→ render digest
→ send scheduled notification

Key Insight

Notification System 不只是发送消息, 它是一个由规则驱动的多渠道投递 pipeline。


🧠 Staff-Level Answer(最终版)


👉 面试回答(完整背诵版)

在设计 Notification System 时, 我会把它看作一个由 policy 驱动的多渠道投递 pipeline。

业务服务发布事件, notification service 决定是否应该通知用户、 使用哪个渠道、 以及何时投递。

系统需要支持 push、email、SMS 和 in-app notifications。

我会将逻辑上的 notification record 和具体渠道的 delivery records 分开, 这样每个渠道可以独立重试和追踪状态。

在发送之前, 系统需要进行去重、 检查用户偏好、 应用 policy rules、 渲染模板, 并将投递任务放入队列。

Delivery workers 应该按渠道拆分, 并通过 provider adapters 集成外部服务。

为了可靠性, 我会使用 exponential backoff retry、 dead-letter queues、 idempotency keys, 并为关键通知设计 provider fallback。

为了保护用户体验和控制成本, 我会加入 rate limiting、batching、 quiet hours 和 priority queues。

核心权衡包括延迟、可靠性、成本、 用户偏好正确性和外部 provider 依赖。

最终目标是在合适的时间, 通过合适的渠道, 将正确的消息可靠地发送给正确的用户。


⭐ Final Insight

Notification System 的核心不只是发送消息, 而是在大规模下实现可靠、合规、可控的多渠道投递。

Implement