sd-mds Modern Distributed Systems ·

🎯 Multi-region System Design: Active-active vs Active-passive

1️⃣ Core Framework

When discussing multi-region system design, I frame it as:

Why multi-region is needed
Active-active architecture
Active-passive architecture
Traffic routing
Data replication
Failover strategy
Consistency trade-offs
Trade-offs: availability vs complexity vs cost

2️⃣ Why Multi-region Systems Exist

A multi-region system runs across multiple geographic regions.

The goal is to improve:

Availability
Disaster recovery
Lower user latency
Regional fault tolerance
Compliance
Business continuity

Basic Idea

Users
→ Global Traffic Router
→ Region A / Region B / Region C
→ Regional Services
→ Regional Data Stores

👉 Interview Answer

Multi-region systems are designed to survive regional failures and serve users closer to their location.

The main goals are higher availability, disaster recovery, lower latency, and business continuity.

The key design choice is whether regions are active-active or active-passive.

3️⃣ What Is Active-passive?

Definition

In active-passive architecture, one primary region serves live traffic.

Another region stays on standby.

Normal state:
Users → Active Region

Failure state:
Users → Passive Region after failover

Example

Region A: active
Region B: standby

Best For

Active-passive is good when:

Simplicity matters
Strong consistency matters
Write conflicts must be avoided
Failover can tolerate some delay
Cost should be controlled

👉 Interview Answer

Active-passive means one region handles production traffic, while another region is kept as a standby.

If the active region fails, traffic is failed over to the passive region.

It is simpler than active-active, but recovery may take longer.

4️⃣ What Is Active-active?

Definition

In active-active architecture, multiple regions serve live traffic at the same time.

Users in US → US Region
Users in Europe → Europe Region
Users in Asia → Asia Region

Example

Region A: active
Region B: active
Region C: active

Best For

Active-active is good when:

Very high availability is required
Low global latency matters
Traffic must be distributed globally
Regional failures should be handled quickly
The system can tolerate replication complexity

👉 Interview Answer

Active-active means multiple regions serve production traffic simultaneously.

It can provide lower latency and higher availability, but it is much harder to design because data replication, consistency, and conflict resolution become more complex.

5️⃣ Core Difference

Active-passive

One region serves traffic.
Another region waits for failover.

Active-active

Multiple regions serve traffic at the same time.

Comparison Table

Dimension	Active-passive	Active-active
Availability	High	Very high
Latency	Depends on active region	Lower globally
Complexity	Lower	Higher
Cost	Lower	Higher
Failover	Slower	Faster
Data conflicts	Easier to avoid	Harder
Operations	Simpler	More complex
Best for	DR and simplicity	Global scale and low latency

👉 Interview Answer

The main difference is whether multiple regions serve traffic simultaneously.

Active-passive is simpler and avoids many conflict problems.

Active-active gives better availability and latency, but increases complexity in routing, replication, consistency, and operations.

6️⃣ Traffic Routing

Global Routing Options

Multi-region systems usually use:

DNS routing
Global load balancer
Anycast routing
Geo-based routing
Latency-based routing
Weighted routing
Health-check-based failover

Active-passive Routing

All traffic → Primary region

If primary unhealthy:
Traffic → Standby region

Active-active Routing

User traffic
→ Nearest healthy region
→ Regional service

👉 Interview Answer

Traffic routing is a core part of multi-region design.

Active-passive usually routes all traffic to the primary region until failover.

Active-active routes users to the nearest or best healthy region using geo, latency, or weighted routing.

7️⃣ Data Replication

Why Replication Matters

Services are easy to run in multiple regions.

Data is the hard part.

Replication Types

Type	Description
Synchronous replication	Write waits for remote copy
Asynchronous replication	Write returns before remote copy
Multi-primary replication	Multiple regions accept writes
Single-primary replication	One region owns writes
Event-based replication	Changes replicate through logs/events

Core Problem

If data is not replicated correctly,
failover may lose or corrupt data.

👉 Interview Answer

In multi-region systems, data replication is usually the hardest problem.

The system must decide whether writes are synchronous, asynchronous, single-primary, or multi-primary.

This decision directly affects consistency, latency, and failover safety.

8️⃣ Active-passive Data Design

Common Pattern

Active-passive usually uses single-primary writes.

Region A database → Primary
Region B database → Replica

Replication

Usually asynchronous or semi-synchronous.

Write to primary
→ Replicate to standby

Failover

Primary fails
→ Promote replica
→ Route traffic to standby region

Risk

If replication lag exists, some recent writes may be lost.

👉 Interview Answer

Active-passive usually uses a primary database in the active region and a replicated standby database in the passive region.

On failure, the replica is promoted and traffic is routed to the standby region.

The main risk is replication lag and possible data loss.

9️⃣ Active-active Data Design

Common Pattern

Active-active may allow writes in multiple regions.

Region A accepts writes
Region B accepts writes
Region C accepts writes

Challenge

Concurrent writes may conflict.

Example Conflict

User updates profile in Region A.
Same user updates profile in Region B.
Both updates happen at nearly the same time.

Conflict Resolution Options

Last-write-wins
Version vectors
CRDTs
Region ownership
Application-level merge
Single-writer per entity

👉 Interview Answer

Active-active data design is difficult because multiple regions may accept writes at the same time.

This can create write conflicts.

The system needs a conflict-resolution strategy such as single-writer ownership, last-write-wins, CRDTs, or application-level merging.

🔟 Consistency Trade-offs

Strong Consistency Across Regions

Write must be confirmed by multiple regions.

Pros

Correct global state
Fewer conflicts

Cons

Higher latency
Lower availability during network partitions

Eventual Consistency Across Regions

Write succeeds locally.
Other regions catch up later.

Pros

Low latency
Higher availability

Cons

Stale reads
Conflicts
More application complexity

👉 Interview Answer

Multi-region systems must trade off consistency, latency, and availability.

Strong cross-region consistency increases latency and can reduce availability.

Eventual consistency improves latency and availability, but the application must handle stale reads and conflicts.

1️⃣1️⃣ Failover Strategy

Failover Types

Manual Failover

Human operator decides.

Detect failure
→ Operator verifies
→ Promote standby
→ Route traffic

Automatic Failover

System detects failure and reroutes traffic.

Health check fails
→ Traffic shifts automatically

Trade-off

Automatic failover = faster but riskier

Manual failover = slower but safer

👉 Interview Answer

Failover strategy determines how the system moves traffic when a region fails.

Automatic failover is faster, but can cause split-brain if detection is wrong.

Manual failover is slower, but safer for complex stateful systems.

1️⃣2️⃣ Split-brain Problem

What Is Split-brain?

Split-brain happens when two regions both believe they are primary.

Region A thinks it is primary.
Region B also thinks it is primary.
Both accept writes.
Data diverges.

Why Dangerous

It can cause:

Duplicate writes
Conflicting records
Data corruption
Inconsistent user state

Prevention

Leader election
Fencing tokens
Quorum writes
Strong health checks
Single-writer ownership
Manual promotion for critical systems

👉 Interview Answer

Split-brain is when multiple regions believe they are primary and accept conflicting writes.

It is one of the most dangerous multi-region failure modes.

Systems prevent it using leader election, quorum, fencing tokens, single-writer ownership, or controlled manual failover.

1️⃣3️⃣ RPO and RTO

RPO: Recovery Point Objective

How much data loss is acceptable?

RPO = 5 minutes
→ Up to 5 minutes of data loss may be acceptable.

RTO: Recovery Time Objective

How long can the system be down?

RTO = 10 minutes
→ Service should recover within 10 minutes.

Why Important

Architecture depends on RPO and RTO.

👉 Interview Answer

RPO defines how much data loss is acceptable.

RTO defines how long the system can be unavailable.

Active-active usually targets lower RTO, while active-passive can be designed for lower cost but may have higher RTO.

1️⃣4️⃣ Read and Write Patterns

Read-heavy Systems

Active-active can be easier.

Writes → Primary region
Reads → Any regional replica

Write-heavy Systems

Active-active is harder.

Writes in multiple regions
→ More conflicts
→ Harder consistency

Good Pattern

Use regional read replicas, but keep writes single-primary when strong correctness matters.

👉 Interview Answer

Read-heavy systems are easier to make multi-region because reads can be served from replicas.

Write-heavy systems are harder because multi-region writes create consistency and conflict-resolution challenges.

1️⃣5️⃣ When to Choose Active-passive

Choose Active-passive When

Simplicity matters
Strong consistency matters
Write conflicts are risky
Traffic is mostly in one region
DR is the main goal
Cost must be controlled
Failover delay is acceptable

Example

Banking ledger system
→ Active-passive or single-writer design

👉 Interview Answer

I would choose active-passive when the system needs disaster recovery, but not constant multi-region serving.

It is better when strong consistency, simpler operations, and lower cost matter more than global low latency.

1️⃣6️⃣ When to Choose Active-active

Choose Active-active When

Global low latency matters
Very high availability is required
Traffic is distributed globally
Read traffic is heavy
System can tolerate eventual consistency
Conflict resolution is manageable
Cost and complexity are acceptable

Example

Global social feed
→ Active-active can work well

👉 Interview Answer

I would choose active-active when global latency and availability are critical, and the application can handle eventual consistency or conflict resolution.

It is powerful, but it requires much more operational maturity.

1️⃣7️⃣ Common Failure Modes

Failure Modes

Multi-region systems can fail because of:

Replication lag
Split-brain
DNS failover delay
Stale reads
Data conflicts
Partial regional outage
Bad health checks
Dependency still single-region
Incomplete failover runbook

Example

Application fails over to Region B,
but payment dependency still only works in Region A.

👉 Interview Answer

Multi-region systems often fail because not every dependency is truly multi-region.

DNS, databases, queues, caches, third-party APIs, and operational runbooks all need to be designed for failover.

1️⃣8️⃣ Observability

What to Monitor

Regional health
Replication lag
Error rate by region
Latency by region
Traffic distribution
Database failover status
Queue lag
Data conflict rate
DNS routing status
RPO / RTO compliance

Debugging Questions

Which region is serving traffic?
Is replication healthy?
Are reads stale?
Has failover actually completed?
Are dependencies available in the new region?

👉 Interview Answer

Multi-region observability must track regional health, traffic distribution, replication lag, failover state, dependency health, latency, error rates, and conflict rates.

Without observability, failover is hard to trust.

1️⃣9️⃣ Best Practices

Practical Rules

Define RPO and RTO first
Keep writes single-primary if possible
Use active-active carefully for writes
Design traffic routing explicitly
Monitor replication lag
Avoid split-brain
Test failover regularly
Make dependencies multi-region
Document failover runbooks
Prefer simplicity unless business requires active-active

Design Principle

Multi-region is easy for stateless services.
It is hard for stateful data.

👉 Interview Answer

The best multi-region design starts with RPO, RTO, traffic patterns, and consistency requirements.

Stateless services are easier to run in multiple regions.

Stateful data replication is where most complexity lives.

🧠 Staff-Level Answer Final

👉 Interview Answer Full Version

Multi-region system design is about surviving regional failures, reducing global latency, and meeting disaster recovery requirements.

The main architectural choice is active-passive versus active-active.

In active-passive, one primary region serves live traffic while another region stays on standby.

If the primary region fails, traffic is failed over to the passive region, and the standby database may be promoted.

This design is simpler, cheaper, and easier to reason about, especially when strong consistency and write correctness matter.

The downside is that failover can take longer, and replication lag can cause data loss depending on the RPO.

In active-active, multiple regions serve production traffic at the same time.

Users can be routed to the closest healthy region, which improves latency and availability.

However, active-active is significantly more complex, especially for writes.

If multiple regions accept writes, the system must handle replication, conflict resolution, stale reads, and split-brain risks.

The hardest part of multi-region design is not running stateless services in multiple regions.

The hard part is stateful data: databases, queues, caches, and external dependencies.

For many systems, a good compromise is active-active stateless services with single-primary writes, plus regional read replicas.

This improves read latency and availability while avoiding multi-primary write conflicts.

The design should start with RPO and RTO.

RPO tells us how much data loss is acceptable.

RTO tells us how long the system can be unavailable.

These targets determine whether asynchronous replication, synchronous replication, manual failover, or automatic failover is appropriate.

Traffic routing can use DNS, global load balancers, geo-routing, latency-based routing, weighted routing, and health checks.

Failover must be tested regularly, because many systems fail during disaster recovery due to hidden single-region dependencies.

The core trade-off is: active-passive is simpler and cheaper, while active-active gives better latency and availability but much higher complexity.

The core principle is: multi-region is easy for stateless services, but hard for stateful data.

⭐ Final Insight

Multi-region System Design 的核心不是：

“多部署几个 region”

真正难的是：

Traffic Routing

Data Replication

Consistency

Failover

RPO / RTO

Split-brain Prevention

Dependency Readiness

Observability。

Active-passive 更简单、更便宜。

Active-active 更高可用、更低延迟，但复杂很多。

最重要的一句话：

Multi-region is easy for stateless services.

It is hard for stateful data.

中文部分

🎯 Multi-region System Design: Active-active vs Active-passive

1️⃣ 核心框架

讨论 multi-region system design 时，我通常从这些方面分析：

为什么需要 multi-region
Active-active architecture
Active-passive architecture
Traffic routing
Data replication
Failover strategy
Consistency trade-offs
核心权衡：availability vs complexity vs cost

2️⃣ 为什么需要 Multi-region Systems？

Multi-region system 运行在多个 geographic regions。

目标是提升：

Availability
Disaster recovery
Lower user latency
Regional fault tolerance
Compliance
Business continuity

Basic Idea

Users
→ Global Traffic Router
→ Region A / Region B / Region C
→ Regional Services
→ Regional Data Stores

👉 面试回答

Multi-region systems 用来承受 regional failures，并让用户访问离自己更近的 region。

主要目标是 higher availability、 disaster recovery、lower latency 和 business continuity。

核心设计选择是： active-active 还是 active-passive。

3️⃣ 什么是 Active-passive？

Definition

在 active-passive architecture 中，一个 primary region 处理 live traffic。

另一个 region 作为 standby。

Normal state:
Users → Active Region

Failure state:
Users → Passive Region after failover

Example

Region A: active
Region B: standby

Best For

Active-passive 适合：

Simplicity matters
Strong consistency matters
Write conflicts must be avoided
Failover can tolerate some delay
Cost should be controlled

👉 面试回答

Active-passive 意味着一个 region 处理 production traffic，另一个 region 保持 standby。

如果 active region 失败， traffic 会 fail over 到 passive region。

它比 active-active 简单，但 recovery 可能更慢。

4️⃣ 什么是 Active-active？

Definition

在 active-active architecture 中，多个 regions 同时处理 live traffic。

Users in US → US Region
Users in Europe → Europe Region
Users in Asia → Asia Region

Example

Region A: active
Region B: active
Region C: active

Best For

Active-active 适合：

Very high availability is required
Low global latency matters
Traffic must be distributed globally
Regional failures should be handled quickly
The system can tolerate replication complexity

👉 面试回答

Active-active 意味着多个 regions 同时处理 production traffic。

它可以提供更低 latency 和更高 availability，但设计更难，因为 data replication、consistency 和 conflict resolution 会更复杂。

5️⃣ 核心区别

Active-passive

One region serves traffic.
Another region waits for failover.

Active-active

Multiple regions serve traffic at the same time.

Comparison Table

Dimension	Active-passive	Active-active
Availability	High	Very high
Latency	Depends on active region	Lower globally
Complexity	Lower	Higher
Cost	Lower	Higher
Failover	Slower	Faster
Data conflicts	Easier to avoid	Harder
Operations	Simpler	More complex
Best for	DR and simplicity	Global scale and low latency

👉 面试回答

核心区别是：是否多个 regions 同时 serve traffic。

Active-passive 更简单，避免很多 conflict problems。

Active-active 提供更好 availability 和 latency，但增加 routing、replication、 consistency 和 operations complexity。

6️⃣ Traffic Routing

Global Routing Options

Multi-region systems 通常使用：

DNS routing
Global load balancer
Anycast routing
Geo-based routing
Latency-based routing
Weighted routing
Health-check-based failover

Active-passive Routing

All traffic → Primary region

If primary unhealthy:
Traffic → Standby region

Active-active Routing

User traffic
→ Nearest healthy region
→ Regional service

👉 面试回答

Traffic routing 是 multi-region design 的核心部分。

Active-passive 通常把所有 traffic 路由到 primary region，直到 failover。

Active-active 使用 geo、latency 或 weighted routing，把 users 路由到最近或最合适的 healthy region。

7️⃣ Data Replication

为什么 Replication 重要？

Services 很容易跑在多个 regions。

Data 才是难点。

Replication Types

Type	Description
Synchronous replication	Write waits for remote copy
Asynchronous replication	Write returns before remote copy
Multi-primary replication	Multiple regions accept writes
Single-primary replication	One region owns writes
Event-based replication	Changes replicate through logs/events

Core Problem

If data is not replicated correctly,
failover may lose or corrupt data.

👉 面试回答

在 multi-region systems 中， data replication 通常是最难的问题。

系统必须决定 writes 是 synchronous、 asynchronous、single-primary 还是 multi-primary。

这个决定直接影响 consistency、latency 和 failover safety。

8️⃣ Active-passive Data Design

Common Pattern

Active-passive 通常使用 single-primary writes。

Region A database → Primary
Region B database → Replica

Replication

通常是 asynchronous 或 semi-synchronous。

Write to primary
→ Replicate to standby

Failover

Primary fails
→ Promote replica
→ Route traffic to standby region

Risk

如果存在 replication lag，最近的 writes 可能丢失。

👉 面试回答

Active-passive 通常在 active region 使用 primary database，在 passive region 使用 replicated standby database。

失败时， replica 被 promote， traffic 被 route 到 standby region。

主要风险是 replication lag 和 potential data loss。

9️⃣ Active-active Data Design

Common Pattern

Active-active 可能允许多个 regions 写入。

Region A accepts writes
Region B accepts writes
Region C accepts writes

Challenge

Concurrent writes 可能 conflict。

Example Conflict

User updates profile in Region A.
Same user updates profile in Region B.
Both updates happen at nearly the same time.

Conflict Resolution Options

Last-write-wins
Version vectors
CRDTs
Region ownership
Application-level merge
Single-writer per entity

👉 面试回答

Active-active data design 很难，因为多个 regions 可能同时接受 writes。

这会产生 write conflicts。

系统需要 conflict-resolution strategy，比如 single-writer ownership、 last-write-wins、CRDTs 或 application-level merging。

🔟 Consistency Trade-offs

Strong Consistency Across Regions

Write must be confirmed by multiple regions.

Pros

Correct global state
Fewer conflicts

Cons

Higher latency
Lower availability during network partitions

Eventual Consistency Across Regions

Write succeeds locally.
Other regions catch up later.

Pros

Low latency
Higher availability

Cons

Stale reads
Conflicts
More application complexity

👉 面试回答

Multi-region systems 必须在 consistency、 latency 和 availability 之间权衡。

Strong cross-region consistency 会增加 latency，并可能在 network partitions 时降低 availability。

Eventual consistency 提升 latency 和 availability，但 application 必须处理 stale reads 和 conflicts。

1️⃣1️⃣ Failover Strategy

Failover Types

Manual Failover

Human operator decides。

Detect failure
→ Operator verifies
→ Promote standby
→ Route traffic

Automatic Failover

System detects failure and reroutes traffic。

Health check fails
→ Traffic shifts automatically

Trade-off

Automatic failover = faster but riskier

Manual failover = slower but safer

👉 面试回答

Failover strategy 决定 region failure 时系统如何移动 traffic。

Automatic failover 更快，但如果 detection 错误，可能造成 split-brain。

Manual failover 更慢，但对复杂 stateful systems 更安全。

1️⃣2️⃣ Split-brain Problem

什么是 Split-brain？

Split-brain 是两个 regions 都认为自己是 primary。

Region A thinks it is primary.
Region B also thinks it is primary.
Both accept writes.
Data diverges.

为什么危险？

它可能造成：

Duplicate writes
Conflicting records
Data corruption
Inconsistent user state

Prevention

Leader election
Fencing tokens
Quorum writes
Strong health checks
Single-writer ownership
Manual promotion for critical systems

👉 面试回答

Split-brain 是多个 regions 都认为自己是 primary，并接受 conflicting writes。

这是最危险的 multi-region failure modes 之一。

系统通过 leader election、quorum、 fencing tokens、single-writer ownership 或 controlled manual failover 来防止。

1️⃣3️⃣ RPO and RTO

RPO: Recovery Point Objective

可接受多少 data loss？

RPO = 5 minutes
→ Up to 5 minutes of data loss may be acceptable.

RTO: Recovery Time Objective

系统可以 down 多久？

RTO = 10 minutes
→ Service should recover within 10 minutes.

为什么重要？

Architecture depends on RPO and RTO。

👉 面试回答

RPO 定义可接受的数据丢失量。

RTO 定义系统可接受的不可用时间。

Active-active 通常目标是更低 RTO， active-passive 可以降低成本，但 RTO 可能更高。

1️⃣4️⃣ Read and Write Patterns

Read-heavy Systems

Active-active 可以更容易。

Writes → Primary region
Reads → Any regional replica

Write-heavy Systems

Active-active 更难。

Writes in multiple regions
→ More conflicts
→ Harder consistency

Good Pattern

使用 regional read replicas，但在 correctness 重要时保持 single-primary writes。

👉 面试回答

Read-heavy systems 更容易 multi-region，因为 reads 可以由 replicas 处理。

Write-heavy systems 更难，因为 multi-region writes 会产生 consistency 和 conflict-resolution challenges。

1️⃣5️⃣ When to Choose Active-passive

Choose Active-passive When

Simplicity matters
Strong consistency matters
Write conflicts are risky
Traffic is mostly in one region
DR is the main goal
Cost must be controlled
Failover delay is acceptable

Example

Banking ledger system
→ Active-passive or single-writer design

👉 面试回答

当系统需要 disaster recovery，但不需要持续 multi-region serving 时，我会选择 active-passive。

当 strong consistency、simpler operations 和 lower cost 比 global low latency 更重要时， active-passive 更合适。

1️⃣6️⃣ When to Choose Active-active

Choose Active-active When

Global low latency matters
Very high availability is required
Traffic is distributed globally
Read traffic is heavy
System can tolerate eventual consistency
Conflict resolution is manageable
Cost and complexity are acceptable

Example

Global social feed
→ Active-active can work well

👉 面试回答

当 global latency 和 availability 很关键，且 application 可以处理 eventual consistency 或 conflict resolution 时，我会选择 active-active。

它很强大，但需要更高 operational maturity。

1️⃣7️⃣ Common Failure Modes

Failure Modes

Multi-region systems 可能失败因为：

Replication lag
Split-brain
DNS failover delay
Stale reads
Data conflicts
Partial regional outage
Bad health checks
Dependency still single-region
Incomplete failover runbook

Example

Application fails over to Region B,
but payment dependency still only works in Region A.

👉 面试回答

Multi-region systems 经常失败，是因为并不是每个 dependency 都真正 multi-region。

DNS、databases、queues、caches、 third-party APIs 和 operational runbooks 都需要为 failover 设计。

1️⃣8️⃣ Observability

What to Monitor

Regional health
Replication lag
Error rate by region
Latency by region
Traffic distribution
Database failover status
Queue lag
Data conflict rate
DNS routing status
RPO / RTO compliance

Debugging Questions

哪个 region 正在 serve traffic？
Replication 是否 healthy？
Reads 是否 stale？
Failover 是否真的完成？
Dependencies 在新 region 是否可用？

👉 面试回答

Multi-region observability 必须追踪 regional health、traffic distribution、 replication lag、failover state、 dependency health、latency、error rates 和 conflict rates。

没有 observability， failover 很难被信任。

1️⃣9️⃣ Best Practices

Practical Rules

Define RPO and RTO first
Keep writes single-primary if possible
Use active-active carefully for writes
Design traffic routing explicitly
Monitor replication lag
Avoid split-brain
Test failover regularly
Make dependencies multi-region
Document failover runbooks
Prefer simplicity unless business requires active-active

Design Principle

Multi-region is easy for stateless services.
It is hard for stateful data.

👉 面试回答

最好的 multi-region design 从 RPO、RTO、traffic patterns 和 consistency requirements 开始。

Stateless services 更容易运行在多个 regions。

Stateful data replication 才是大部分复杂性所在。

🧠 Staff-Level Answer Final

👉 面试回答完整版本

Multi-region system design 是为了 surviving regional failures、降低 global latency，并满足 disaster recovery requirements。

主要 architecture choice 是 active-passive vs active-active。

在 active-passive 中，一个 primary region serve live traffic，另一个 region standby。

如果 primary region 失败， traffic 会 fail over 到 passive region， standby database 可能会被 promote。

这个设计更简单、更便宜，也更容易 reasoning，尤其当 strong consistency 和 write correctness 很重要时。

缺点是 failover 可能更慢，而且根据 RPO， replication lag 可能造成 data loss。

在 active-active 中，多个 regions 同时 serve production traffic。

Users 可以 route 到最近的 healthy region，这提升 latency 和 availability。

但是 active-active 明显更复杂，特别是 writes。

如果多个 regions 都接受 writes，系统必须处理 replication、conflict resolution、 stale reads 和 split-brain risks。

Multi-region design 最难的部分不是让 stateless services 跑在多个 regions。

真正难的是 stateful data： databases、queues、caches 和 external dependencies。

对很多系统来说，一个好的 compromise 是： active-active stateless services

single-primary writes

regional read replicas。

这样既改善 read latency 和 availability，又避免 multi-primary write conflicts。

设计应该从 RPO 和 RTO 开始。

RPO 告诉我们可以接受多少 data loss。

RTO 告诉我们系统可以不可用多久。

这些目标决定 asynchronous replication、 synchronous replication、manual failover 或 automatic failover 是否合适。

Traffic routing 可以使用 DNS、global load balancers、 geo-routing、latency-based routing、 weighted routing 和 health checks。

Failover 必须定期测试，因为很多系统在 disaster recovery 时失败，是因为 hidden single-region dependencies。

核心权衡是： active-passive 更简单、更便宜； active-active 提供更好的 latency 和 availability，但复杂性高很多。

核心原则是： multi-region 对 stateless services 很容易，但对 stateful data 很难。

⭐ Final Insight

Multi-region System Design 的核心不是：

“多部署几个 region”

真正难的是：

Traffic Routing

Data Replication

Consistency

Failover

RPO / RTO

Split-brain Prevention

Dependency Readiness

Observability。

Active-passive 更简单、更便宜。

Active-active 更高可用、更低延迟，但复杂很多。

最重要的一句话：

Multi-region is easy for stateless services.

It is hard for stateful data.