🎯 Design Collaborative Editing
1️⃣ Core Framework
When discussing Collaborative Editing design, I frame it as:
- Real-time synchronization
- Concurrent editing conflict resolution
- Operational transformation (OT) or CRDT
- Presence and cursor updates
- Persistence and version history
- Offline editing and reconnection
- Scalability and low latency
- Consistency vs responsiveness trade-offs
2️⃣ Core Requirements
Functional Requirements
- Multiple users edit the same document simultaneously
- Real-time synchronization
- Show live cursors and presence
- Persist document changes
- Support undo/redo
- Support offline editing
- Support version history
- Resolve concurrent edits correctly
Non-functional Requirements
- Low edit latency
- High availability
- Strong perceived responsiveness
- Eventual consistency acceptable
- Fault tolerance
- Horizontal scalability
- Durable document storage
- Efficient bandwidth usage
👉 Interview Answer
Collaborative editing systems allow multiple users to edit the same document simultaneously with near real-time synchronization.
The core challenge is resolving concurrent edits while keeping the system responsive, scalable, and eventually consistent.
3️⃣ Main APIs
Join Document Session
POST /api/document/join
Request:
{
"documentId": "doc_123",
"userId": "u456"
}
Send Edit Operation
POST /api/document/op
Request:
{
"documentId": "doc_123",
"operation": {
"type": "insert",
"position": 42,
"text": "Hello"
},
"revision": 105
}
Fetch Version History
GET /api/document/history?documentId=doc_123
👉 Interview Answer
The system exposes APIs for session joining, edit operation submission, and version history retrieval.
Real-time edits are typically delivered over WebSockets instead of polling.
4️⃣ High-Level Architecture
Client Editors
↔ WebSocket Gateway
↔ Collaboration Service
↔ OT / CRDT Engine
↔ Document Storage
Presence Service
↔ Cursor / User State
Operation Log
→ Version History / Replay
Main Components
WebSocket Gateway
- Maintains persistent connections
- Pushes real-time updates
- Handles reconnection
Collaboration Service
- Coordinates sessions
- Broadcasts operations
- Tracks document state
OT / CRDT Engine
- Resolves concurrent edits
- Preserves convergence
- Maintains consistency
Storage Layer
- Stores document snapshots
- Stores operation logs
- Supports history replay
👉 Interview Answer
I would design collaborative editing as a real-time bidirectional synchronization system.
Clients exchange operations through WebSockets, the collaboration service coordinates edits, and OT or CRDT logic resolves conflicts.
5️⃣ Real-time Synchronization
Why WebSockets?
Collaborative editing requires:
- Low latency
- Bidirectional communication
- Persistent connection
Edit Flow
User types
→ Local operation generated
→ Send operation via WebSocket
→ Server validates operation
→ Broadcast to collaborators
→ Clients apply operation
Optimistic UI
Clients apply edits locally before server acknowledgment.
👉 Interview Answer
The system should use WebSockets because edits must propagate in real time.
Clients usually apply edits optimistically for responsiveness, while the server synchronizes canonical ordering.
6️⃣ Concurrent Editing Problem
Example Conflict
User A:
Insert "Hello" at position 5
User B:
Delete character at position 3
Both happen simultaneously.
Challenges
- Position shifts
- Ordering ambiguity
- Network delay
- Out-of-order delivery
Goals
- Convergence
- Intention preservation
- Consistency
👉 Interview Answer
Concurrent edits are the hardest part of collaborative editing.
The system must guarantee that all users eventually converge to the same document state.
7️⃣ Operational Transformation (OT)
Core Idea
Transform incoming operations based on previously applied operations.
Example
Original document:
abcde
User A:
Insert X at position 2
User B:
Delete position 1
OT adjusts positions dynamically.
Advantages
- Mature approach
- Used in Google Docs
- Efficient bandwidth
Challenges
- Complex transformation rules
- Difficult edge cases
- Hard to implement correctly
👉 Interview Answer
OT transforms concurrent operations so edits remain consistent.
The server maintains operation ordering, and clients transform operations relative to already-applied edits.
8️⃣ CRDT (Conflict-free Replicated Data Type)
Core Idea
Every replica independently applies operations while mathematically guaranteeing convergence.
Characteristics
- No central transformation required
- Naturally supports offline editing
- Strong eventual consistency
Common CRDT Types
- Sequence CRDT
- RGA
- LSEQ
- Yjs
- Automerge
Trade-offs
Pros:
- Better offline support
- Easier peer-to-peer collaboration
Cons:
- Higher metadata overhead
- More memory usage
- More complex storage
👉 Interview Answer
CRDTs allow distributed replicas to independently process edits while still converging automatically.
They work especially well for offline-first collaboration systems.
9️⃣ OT vs CRDT
| Aspect | OT | CRDT |
|---|---|---|
| Centralized coordination | Usually yes | Not required |
| Offline support | Harder | Better |
| Metadata overhead | Lower | Higher |
| Implementation complexity | High | High |
| Memory usage | Lower | Higher |
| Mature production usage | Very common | Growing rapidly |
Common Industry Choices
Google Docs → OT
Figma → Hybrid approaches
Modern offline editors → CRDT
👉 Interview Answer
OT is common in centralized systems like Google Docs, while CRDTs are increasingly popular for offline-first collaborative editing.
The trade-off is usually operational complexity versus metadata overhead.
🔟 Presence and Cursor Tracking
Presence Features
- Online users
- Live cursors
- User selections
- Typing indicators
Presence Flow
Client heartbeat
→ Presence service updates state
→ Broadcast cursor positions
Optimization
- Throttle cursor updates
- Compress updates
- Use ephemeral storage
👉 Interview Answer
Presence data is transient and should not go through durable storage.
I would use lightweight in-memory systems for cursor tracking and online presence.
1️⃣1️⃣ Persistence and Version History
What to Persist
- Current document snapshot
- Operation log
- User actions
- Timestamps
- Version checkpoints
Snapshot Strategy
Periodic snapshot
+ incremental operation log
Why Keep Operation Logs?
- Undo/redo
- Replay
- Audit history
- Recovery
- Conflict debugging
Example
Snapshot every 100 operations
👉 Interview Answer
I would persist both document snapshots and operation logs.
Snapshots speed up recovery, while operation logs support replay, undo, and version history.
1️⃣2️⃣ Offline Editing
Offline Flow
Client disconnected
→ Local edits queued
→ Reconnect
→ Sync pending operations
→ Resolve conflicts
Challenges
- Operation ordering
- Version divergence
- Large conflict windows
Why CRDT Helps
Because replicas merge automatically.
👉 Interview Answer
Offline editing is difficult because clients may diverge significantly.
CRDTs simplify offline synchronization, while OT-based systems usually require stronger coordination.
1️⃣3️⃣ Scaling Patterns
Pattern 1: Document Sharding
Partition by document ID.
Pattern 2: Sticky Sessions
Keep document users on same collaboration node.
Pattern 3: Pub/Sub Fanout
Broadcast operations efficiently.
Pattern 4: Snapshot Compaction
Reduce replay overhead.
Pattern 5: Delta Sync
Only send incremental changes.
👉 Interview Answer
To scale collaborative editing, I would shard by document, use sticky sessions for active collaborators, and broadcast incremental operations through pub/sub.
1️⃣4️⃣ Failure Handling
Common Failures
- WebSocket disconnect
- Duplicate operations
- Out-of-order delivery
- Collaboration server crash
- Lost presence updates
- Split-brain conflicts
Strategies
- Client retry
- Idempotent operations
- Sequence numbers
- Session recovery
- Heartbeats
- Operation replay
👉 Interview Answer
Real-time collaboration systems must tolerate disconnects and retries.
I would use sequence numbers, idempotent operations, and replay mechanisms to recover session state.
1️⃣5️⃣ Consistency Model
Strong Consistency Needed For
- Final document convergence
- Access control
- Permissions
- Critical saves
Eventual Consistency Acceptable For
- Cursor positions
- Presence indicators
- Typing status
- Temporary local divergence
Local-first UX
Fast local responsiveness
→ eventual synchronization
👉 Interview Answer
Collaborative editing systems prioritize responsiveness.
Temporary divergence is acceptable, as long as all clients eventually converge to the same document state.
1️⃣6️⃣ Observability
Key Metrics
- Edit propagation latency
- WebSocket connection count
- Operation throughput
- Conflict resolution rate
- Sync failure rate
- Reconnection frequency
- Snapshot recovery latency
- Presence update latency
Alerts
- High operation lag
- Failed synchronization spikes
- WebSocket disconnect surge
- Collaboration node overload
👉 Interview Answer
I would monitor edit latency, synchronization failures, reconnect frequency, operation throughput, and conflict resolution metrics.
These metrics show whether collaboration feels real time and reliable.
1️⃣7️⃣ End-to-End Flow
Real-time Editing Flow
User types
→ Local operation generated
→ WebSocket sends operation
→ Collaboration server sequences operation
→ OT/CRDT resolves conflicts
→ Broadcast updates
→ Clients apply operations
→ Snapshot and logs persisted
Offline Recovery Flow
Client reconnects
→ Sends pending operations
→ Server merges operations
→ Missing updates replayed
→ Client converges
Key Insight
Collaborative editing is fundamentally a distributed synchronization and conflict resolution system.
🧠 Staff-Level Answer (Final)
👉 Interview Answer (Full Version)
When designing a collaborative editing system, I think of it as a real-time distributed synchronization platform.
Clients maintain persistent WebSocket connections to collaboration servers, which coordinate edits across users.
Users generate local edit operations, and those operations are propagated in real time to other collaborators.
The core challenge is concurrent editing.
Multiple users may modify overlapping regions of the document simultaneously, so the system needs a conflict resolution strategy such as Operational Transformation or CRDTs.
OT transforms operations relative to previously applied edits, while CRDTs guarantee convergence mathematically across distributed replicas.
The system should optimize for responsiveness, so clients usually apply edits optimistically before server acknowledgment.
Presence information such as cursors and typing indicators should be handled separately from durable document storage, because they are transient.
I would persist document snapshots along with operation logs.
Snapshots improve recovery speed, while logs support replay, undo/redo, audit history, and debugging.
For scalability, I would shard by document ID, use sticky collaboration sessions, and broadcast incremental operations through pub/sub systems.
The main trade-offs are consistency, responsiveness, metadata overhead, offline support, and implementation complexity.
Ultimately, collaborative editing is a distributed conflict resolution system that prioritizes low-latency user experience while guaranteeing eventual convergence.
⭐ Final Insight
Collaborative Editing 的核心不是“多人同时写文档”, 而是一个由 real-time synchronization、 OT/CRDT conflict resolution、 operation propagation、 version history 和 eventual convergence 组成的分布式同步系统。
中文部分
🎯 Design Collaborative Editing
1️⃣ 核心框架
设计 Collaborative Editing 时, 我通常从以下几个方面分析:
- Real-time synchronization
- Concurrent editing conflict resolution
- OT 或 CRDT
- Presence 和 cursor updates
- Persistence 和 version history
- Offline editing
- Scalability 和 low latency
- Consistency vs responsiveness
2️⃣ 核心需求
功能需求
- 多用户同时编辑文档
- 实时同步
- 显示 cursor 和在线状态
- 持久化 document changes
- 支持 undo/redo
- 支持 offline editing
- 支持 version history
- 正确处理 concurrent edits
非功能需求
- 低编辑延迟
- 高可用
- 高 responsiveness
- 最终一致即可
- Fault tolerance
- Horizontal scalability
- Durable storage
👉 面试回答
Collaborative editing 支持多个用户 同时编辑同一文档, 并实时同步。
核心挑战是如何解决 concurrent edits, 同时保持低延迟和最终一致。
3️⃣ Main APIs
Join Session
POST /api/document/join
Send Edit Operation
POST /api/document/op
Fetch History
GET /api/document/history
👉 面试回答
系统会提供 session join、 operation submit 和 history retrieval APIs。
实时 edits 通常通过 WebSocket 传输。
4️⃣ High-Level Architecture
Client Editors
↔ WebSocket Gateway
↔ Collaboration Service
↔ OT / CRDT Engine
↔ Document Storage
Main Components
WebSocket Gateway
- Persistent connections
- Push updates
- Reconnection handling
Collaboration Service
- 协调 editing sessions
- 广播 operations
- 管理 document state
OT / CRDT Engine
- Resolve conflicts
- Maintain convergence
Storage Layer
- Store snapshots
- Store operation logs
👉 面试回答
我会把 collaborative editing 设计成 real-time bidirectional sync system。
Clients 通过 WebSocket 交换 operations, OT 或 CRDT engine 负责 conflict resolution。
5️⃣ Real-time Synchronization
为什么 WebSocket?
因为 collaboration 需要:
- Low latency
- Bidirectional communication
- Persistent connection
Edit Flow
User types
→ Local operation generated
→ Send operation
→ Broadcast to collaborators
→ Clients apply operation
Optimistic UI
客户端先本地应用 edits。
👉 面试回答
为了保证 responsiveness, clients 通常会 optimistic apply edits, 然后由 server 协调最终顺序。
6️⃣ Concurrent Editing Problem
Example
User A:
Insert "Hello"
User B:
Delete nearby text
同时发生。
Challenges
- Position shifts
- Ordering ambiguity
- Network delay
Goals
- Convergence
- Intention preservation
- Consistency
👉 面试回答
Concurrent editing 是 collaborative editing 最难部分。
系统必须保证所有 users 最终 converge 到同一个 document state。
7️⃣ Operational Transformation (OT)
Core Idea
根据已应用 operations 动态 transform 新 operations。
Advantages
- 成熟
- Google Docs 使用
- Metadata overhead 较低
Challenges
- Edge cases 很复杂
- 实现困难
👉 面试回答
OT 会动态 transform concurrent operations, 保持 document consistency。
8️⃣ CRDT
Core Idea
多个 replicas 独立 apply operations,
仍然保证 convergence。
Pros
- 更适合 offline editing
- Strong eventual consistency
Cons
- Metadata overhead 更高
- Memory usage 更高
👉 面试回答
CRDT 非常适合 offline-first systems, 因为 replicas 可以独立 merge changes。
9️⃣ OT vs CRDT
| Aspect | OT | CRDT |
|---|---|---|
| Offline support | Harder | Better |
| Metadata overhead | Lower | Higher |
| Memory usage | Lower | Higher |
| Central coordination | Usually yes | Not required |
👉 面试回答
Google Docs 更偏 OT, 而现代 offline-first editors 越来越多使用 CRDT。
🔟 Presence 和 Cursor Tracking
Presence Features
- Online users
- Live cursors
- Typing indicators
Optimization
- Throttle cursor updates
- In-memory storage
👉 面试回答
Presence data 是 transient 的, 不需要 durable persistence。
1️⃣1️⃣ Persistence 和 History
Persist 内容
- Document snapshot
- Operation logs
- User actions
Snapshot Strategy
Periodic snapshots + operation log
👉 面试回答
Snapshots 提高 recovery speed, operation logs 支持 replay、 undo/redo 和 version history。
1️⃣2️⃣ Offline Editing
Flow
Offline edits
→ Local queue
→ Reconnect
→ Sync operations
Challenges
- Divergence
- Ordering
- Conflict resolution
👉 面试回答
Offline editing 很难, 因为 clients 可能长时间 diverge。
CRDT 更适合 offline sync。
1️⃣3️⃣ Scaling Patterns
Common Strategies
- Document sharding
- Sticky sessions
- Pub/Sub fanout
- Delta synchronization
👉 面试回答
为了 scale, 我会按 document shard, 并使用 pub/sub 广播 operations。
1️⃣4️⃣ Failure Handling
Common Failures
- WebSocket disconnect
- Duplicate operations
- Out-of-order delivery
Recovery
- Retry
- Sequence numbers
- Replay operations
👉 面试回答
系统必须容忍 disconnect 和 retry。
Sequence numbers 和 replay mechanisms 非常重要。
1️⃣5️⃣ Consistency Model
Strong Consistency Needed
- Final document convergence
- Access control
Eventual Consistency Acceptable
- Cursor positions
- Presence
👉 面试回答
Collaborative editing 优先 responsiveness。
Temporary divergence 可以接受, 只要最终 converge。
🧠 Staff-Level Answer(最终版)
👉 面试回答(完整背诵版)
在设计 Collaborative Editing System 时, 我会把它看作一个 real-time distributed synchronization platform。
Clients 通过 WebSocket 与 collaboration servers 建立 persistent connections。
Users 产生 local edit operations, 系统实时将这些 operations 广播给其他 collaborators。
核心难点是 concurrent editing。
多个 users 可能同时修改相同区域, 所以系统需要使用 OT 或 CRDT 做 conflict resolution。
OT 会 transform operations, CRDT 则通过数学方式保证 replicas converge。
为了 responsiveness, clients 通常 optimistic apply edits。
Presence 信息如 cursor 和 typing indicators 属于 transient state, 应该和 durable storage 分离。
我会持久化 snapshots 和 operation logs。
Snapshots 提高 recovery speed, logs 支持 replay、 undo/redo、 audit history 和 debugging。
为了 scale, 我会按 document shard, 使用 sticky collaboration sessions, 并通过 pub/sub 广播 incremental operations。
核心 trade-offs 包括 consistency、 responsiveness、 metadata overhead、 offline support 和 implementation complexity。
最终, Collaborative Editing 本质上是一个 distributed conflict resolution system, 目标是在保证 eventual convergence 的同时, 提供低延迟实时协作体验。
⭐ Final Insight
Collaborative Editing 的核心不是“多人同时编辑”, 而是一个由 real-time synchronization、 OT/CRDT conflict resolution、 operation propagation、 version history 和 eventual convergence 组成的分布式同步系统。
Implement