System Design Deep Dive - 27. Design Collaborative Editing

Post by ailswan May. 20, 2026

中文 ↓

🎯 Design Collaborative Editing

1️⃣ Core Framework

When discussing Collaborative Editing design, I frame it as:

  1. Real-time synchronization
  2. Concurrent editing conflict resolution
  3. Operational transformation (OT) or CRDT
  4. Presence and cursor updates
  5. Persistence and version history
  6. Offline editing and reconnection
  7. Scalability and low latency
  8. Consistency vs responsiveness trade-offs

2️⃣ Core Requirements


Functional Requirements


Non-functional Requirements


👉 Interview Answer

Collaborative editing systems allow multiple users to edit the same document simultaneously with near real-time synchronization.

The core challenge is resolving concurrent edits while keeping the system responsive, scalable, and eventually consistent.


3️⃣ Main APIs


Join Document Session

POST /api/document/join

Request:

{
  "documentId": "doc_123",
  "userId": "u456"
}

Send Edit Operation

POST /api/document/op

Request:

{
  "documentId": "doc_123",
  "operation": {
    "type": "insert",
    "position": 42,
    "text": "Hello"
  },
  "revision": 105
}

Fetch Version History

GET /api/document/history?documentId=doc_123

👉 Interview Answer

The system exposes APIs for session joining, edit operation submission, and version history retrieval.

Real-time edits are typically delivered over WebSockets instead of polling.


4️⃣ High-Level Architecture


Client Editors
↔ WebSocket Gateway
↔ Collaboration Service
↔ OT / CRDT Engine
↔ Document Storage

Presence Service
↔ Cursor / User State

Operation Log
→ Version History / Replay

Main Components

WebSocket Gateway


Collaboration Service


OT / CRDT Engine


Storage Layer


👉 Interview Answer

I would design collaborative editing as a real-time bidirectional synchronization system.

Clients exchange operations through WebSockets, the collaboration service coordinates edits, and OT or CRDT logic resolves conflicts.


5️⃣ Real-time Synchronization


Why WebSockets?

Collaborative editing requires:


Edit Flow

User types
→ Local operation generated
→ Send operation via WebSocket
→ Server validates operation
→ Broadcast to collaborators
→ Clients apply operation

Optimistic UI

Clients apply edits locally before server acknowledgment.


👉 Interview Answer

The system should use WebSockets because edits must propagate in real time.

Clients usually apply edits optimistically for responsiveness, while the server synchronizes canonical ordering.


6️⃣ Concurrent Editing Problem


Example Conflict

User A:

Insert "Hello" at position 5

User B:

Delete character at position 3

Both happen simultaneously.


Challenges


Goals


👉 Interview Answer

Concurrent edits are the hardest part of collaborative editing.

The system must guarantee that all users eventually converge to the same document state.


7️⃣ Operational Transformation (OT)


Core Idea

Transform incoming operations based on previously applied operations.


Example

Original document:

abcde

User A:

Insert X at position 2

User B:

Delete position 1

OT adjusts positions dynamically.


Advantages


Challenges


👉 Interview Answer

OT transforms concurrent operations so edits remain consistent.

The server maintains operation ordering, and clients transform operations relative to already-applied edits.


8️⃣ CRDT (Conflict-free Replicated Data Type)


Core Idea

Every replica independently applies operations while mathematically guaranteeing convergence.


Characteristics


Common CRDT Types


Trade-offs

Pros:

Cons:


👉 Interview Answer

CRDTs allow distributed replicas to independently process edits while still converging automatically.

They work especially well for offline-first collaboration systems.


9️⃣ OT vs CRDT


Aspect OT CRDT
Centralized coordination Usually yes Not required
Offline support Harder Better
Metadata overhead Lower Higher
Implementation complexity High High
Memory usage Lower Higher
Mature production usage Very common Growing rapidly

Common Industry Choices

Google Docs → OT
Figma → Hybrid approaches
Modern offline editors → CRDT

👉 Interview Answer

OT is common in centralized systems like Google Docs, while CRDTs are increasingly popular for offline-first collaborative editing.

The trade-off is usually operational complexity versus metadata overhead.


🔟 Presence and Cursor Tracking


Presence Features


Presence Flow

Client heartbeat
→ Presence service updates state
→ Broadcast cursor positions

Optimization


👉 Interview Answer

Presence data is transient and should not go through durable storage.

I would use lightweight in-memory systems for cursor tracking and online presence.


1️⃣1️⃣ Persistence and Version History


What to Persist


Snapshot Strategy

Periodic snapshot
+ incremental operation log

Why Keep Operation Logs?


Example

Snapshot every 100 operations

👉 Interview Answer

I would persist both document snapshots and operation logs.

Snapshots speed up recovery, while operation logs support replay, undo, and version history.


1️⃣2️⃣ Offline Editing


Offline Flow

Client disconnected
→ Local edits queued
→ Reconnect
→ Sync pending operations
→ Resolve conflicts

Challenges


Why CRDT Helps

Because replicas merge automatically.


👉 Interview Answer

Offline editing is difficult because clients may diverge significantly.

CRDTs simplify offline synchronization, while OT-based systems usually require stronger coordination.


1️⃣3️⃣ Scaling Patterns


Pattern 1: Document Sharding

Partition by document ID.


Pattern 2: Sticky Sessions

Keep document users on same collaboration node.


Pattern 3: Pub/Sub Fanout

Broadcast operations efficiently.


Pattern 4: Snapshot Compaction

Reduce replay overhead.


Pattern 5: Delta Sync

Only send incremental changes.


👉 Interview Answer

To scale collaborative editing, I would shard by document, use sticky sessions for active collaborators, and broadcast incremental operations through pub/sub.


1️⃣4️⃣ Failure Handling


Common Failures


Strategies


👉 Interview Answer

Real-time collaboration systems must tolerate disconnects and retries.

I would use sequence numbers, idempotent operations, and replay mechanisms to recover session state.


1️⃣5️⃣ Consistency Model


Strong Consistency Needed For


Eventual Consistency Acceptable For


Local-first UX

Fast local responsiveness
→ eventual synchronization

👉 Interview Answer

Collaborative editing systems prioritize responsiveness.

Temporary divergence is acceptable, as long as all clients eventually converge to the same document state.


1️⃣6️⃣ Observability


Key Metrics


Alerts


👉 Interview Answer

I would monitor edit latency, synchronization failures, reconnect frequency, operation throughput, and conflict resolution metrics.

These metrics show whether collaboration feels real time and reliable.


1️⃣7️⃣ End-to-End Flow


Real-time Editing Flow

User types
→ Local operation generated
→ WebSocket sends operation
→ Collaboration server sequences operation
→ OT/CRDT resolves conflicts
→ Broadcast updates
→ Clients apply operations
→ Snapshot and logs persisted

Offline Recovery Flow

Client reconnects
→ Sends pending operations
→ Server merges operations
→ Missing updates replayed
→ Client converges

Key Insight

Collaborative editing is fundamentally a distributed synchronization and conflict resolution system.


🧠 Staff-Level Answer (Final)


👉 Interview Answer (Full Version)

When designing a collaborative editing system, I think of it as a real-time distributed synchronization platform.

Clients maintain persistent WebSocket connections to collaboration servers, which coordinate edits across users.

Users generate local edit operations, and those operations are propagated in real time to other collaborators.

The core challenge is concurrent editing.

Multiple users may modify overlapping regions of the document simultaneously, so the system needs a conflict resolution strategy such as Operational Transformation or CRDTs.

OT transforms operations relative to previously applied edits, while CRDTs guarantee convergence mathematically across distributed replicas.

The system should optimize for responsiveness, so clients usually apply edits optimistically before server acknowledgment.

Presence information such as cursors and typing indicators should be handled separately from durable document storage, because they are transient.

I would persist document snapshots along with operation logs.

Snapshots improve recovery speed, while logs support replay, undo/redo, audit history, and debugging.

For scalability, I would shard by document ID, use sticky collaboration sessions, and broadcast incremental operations through pub/sub systems.

The main trade-offs are consistency, responsiveness, metadata overhead, offline support, and implementation complexity.

Ultimately, collaborative editing is a distributed conflict resolution system that prioritizes low-latency user experience while guaranteeing eventual convergence.


⭐ Final Insight

Collaborative Editing 的核心不是“多人同时写文档”, 而是一个由 real-time synchronization、 OT/CRDT conflict resolution、 operation propagation、 version history 和 eventual convergence 组成的分布式同步系统。



中文部分


🎯 Design Collaborative Editing


1️⃣ 核心框架

设计 Collaborative Editing 时, 我通常从以下几个方面分析:

  1. Real-time synchronization
  2. Concurrent editing conflict resolution
  3. OT 或 CRDT
  4. Presence 和 cursor updates
  5. Persistence 和 version history
  6. Offline editing
  7. Scalability 和 low latency
  8. Consistency vs responsiveness

2️⃣ 核心需求


功能需求


非功能需求


👉 面试回答

Collaborative editing 支持多个用户 同时编辑同一文档, 并实时同步。

核心挑战是如何解决 concurrent edits, 同时保持低延迟和最终一致。


3️⃣ Main APIs


Join Session

POST /api/document/join

Send Edit Operation

POST /api/document/op

Fetch History

GET /api/document/history

👉 面试回答

系统会提供 session join、 operation submit 和 history retrieval APIs。

实时 edits 通常通过 WebSocket 传输。


4️⃣ High-Level Architecture


Client Editors
↔ WebSocket Gateway
↔ Collaboration Service
↔ OT / CRDT Engine
↔ Document Storage

Main Components

WebSocket Gateway


Collaboration Service


OT / CRDT Engine


Storage Layer


👉 面试回答

我会把 collaborative editing 设计成 real-time bidirectional sync system。

Clients 通过 WebSocket 交换 operations, OT 或 CRDT engine 负责 conflict resolution。


5️⃣ Real-time Synchronization


为什么 WebSocket?

因为 collaboration 需要:


Edit Flow

User types
→ Local operation generated
→ Send operation
→ Broadcast to collaborators
→ Clients apply operation

Optimistic UI

客户端先本地应用 edits。


👉 面试回答

为了保证 responsiveness, clients 通常会 optimistic apply edits, 然后由 server 协调最终顺序。


6️⃣ Concurrent Editing Problem


Example

User A:

Insert "Hello"

User B:

Delete nearby text

同时发生。


Challenges


Goals


👉 面试回答

Concurrent editing 是 collaborative editing 最难部分。

系统必须保证所有 users 最终 converge 到同一个 document state。


7️⃣ Operational Transformation (OT)


Core Idea

根据已应用 operations 动态 transform 新 operations。


Advantages


Challenges


👉 面试回答

OT 会动态 transform concurrent operations, 保持 document consistency。


8️⃣ CRDT


Core Idea

多个 replicas 独立 apply operations,

仍然保证 convergence。


Pros


Cons


👉 面试回答

CRDT 非常适合 offline-first systems, 因为 replicas 可以独立 merge changes。


9️⃣ OT vs CRDT


Aspect OT CRDT
Offline support Harder Better
Metadata overhead Lower Higher
Memory usage Lower Higher
Central coordination Usually yes Not required

👉 面试回答

Google Docs 更偏 OT, 而现代 offline-first editors 越来越多使用 CRDT。


🔟 Presence 和 Cursor Tracking


Presence Features


Optimization


👉 面试回答

Presence data 是 transient 的, 不需要 durable persistence。


1️⃣1️⃣ Persistence 和 History


Persist 内容


Snapshot Strategy

Periodic snapshots + operation log

👉 面试回答

Snapshots 提高 recovery speed, operation logs 支持 replay、 undo/redo 和 version history。


1️⃣2️⃣ Offline Editing


Flow

Offline edits
→ Local queue
→ Reconnect
→ Sync operations

Challenges


👉 面试回答

Offline editing 很难, 因为 clients 可能长时间 diverge。

CRDT 更适合 offline sync。


1️⃣3️⃣ Scaling Patterns


Common Strategies


👉 面试回答

为了 scale, 我会按 document shard, 并使用 pub/sub 广播 operations。


1️⃣4️⃣ Failure Handling


Common Failures


Recovery


👉 面试回答

系统必须容忍 disconnect 和 retry。

Sequence numbers 和 replay mechanisms 非常重要。


1️⃣5️⃣ Consistency Model


Strong Consistency Needed


Eventual Consistency Acceptable


👉 面试回答

Collaborative editing 优先 responsiveness。

Temporary divergence 可以接受, 只要最终 converge。


🧠 Staff-Level Answer(最终版)


👉 面试回答(完整背诵版)

在设计 Collaborative Editing System 时, 我会把它看作一个 real-time distributed synchronization platform。

Clients 通过 WebSocket 与 collaboration servers 建立 persistent connections。

Users 产生 local edit operations, 系统实时将这些 operations 广播给其他 collaborators。

核心难点是 concurrent editing。

多个 users 可能同时修改相同区域, 所以系统需要使用 OT 或 CRDT 做 conflict resolution。

OT 会 transform operations, CRDT 则通过数学方式保证 replicas converge。

为了 responsiveness, clients 通常 optimistic apply edits。

Presence 信息如 cursor 和 typing indicators 属于 transient state, 应该和 durable storage 分离。

我会持久化 snapshots 和 operation logs。

Snapshots 提高 recovery speed, logs 支持 replay、 undo/redo、 audit history 和 debugging。

为了 scale, 我会按 document shard, 使用 sticky collaboration sessions, 并通过 pub/sub 广播 incremental operations。

核心 trade-offs 包括 consistency、 responsiveness、 metadata overhead、 offline support 和 implementation complexity。

最终, Collaborative Editing 本质上是一个 distributed conflict resolution system, 目标是在保证 eventual convergence 的同时, 提供低延迟实时协作体验。


⭐ Final Insight

Collaborative Editing 的核心不是“多人同时编辑”, 而是一个由 real-time synchronization、 OT/CRDT conflict resolution、 operation propagation、 version history 和 eventual convergence 组成的分布式同步系统。

Implement