🎯 Design File Storage (S3-like)
1️⃣ Core Framework
When discussing S3-like File Storage design, I frame it as:
- Core object storage model: bucket, object, key
- Upload, download, delete, and list flows
- Metadata and object data separation
- Partitioning and replication
- Durability and availability
- Large file handling and multipart upload
- Consistency, versioning, and lifecycle policies
- Security, access control, and cost control
2️⃣ Core Requirements
Functional Requirements
- Create buckets
- Upload objects
- Download objects
- Delete objects
- List objects by prefix
- Support object metadata
- Support large files
- Support multipart upload
- Support access control
- Support versioning
- Support lifecycle policies
Non-functional Requirements
- Very high durability
- High availability
- Scalable storage capacity
- High write throughput
- Efficient range reads
- Cost-effective long-term storage
- Strong security and access control
👉 Interview Answer
An S3-like storage system is an object storage system.
It stores data as objects inside buckets, where each object is identified by a key.
The main challenges are durability, availability, scalable metadata management, large file upload, replication, access control, and cost-efficient storage.
3️⃣ Core Concepts
Bucket
A bucket is a namespace for objects.
Example:
bucket = user-photos
Object
An object contains:
- Object data
- Object key
- Metadata
- Version ID
- Checksum
- Size
- Storage class
Object Key
Example:
photos/2026/05/image.jpg
Important:
- Object keys look like paths
- But object storage is not a traditional file system
- Prefixes are used for listing and organization
👉 Interview Answer
I would model the system around buckets and objects.
A bucket provides a namespace, and each object is identified by a key.
The key may look like a file path, but internally the system is usually a flat key-value object store, not a hierarchical file system.
4️⃣ Main APIs
Create Bucket
PUT /buckets/{bucketName}
Upload Object
PUT /buckets/{bucketName}/objects/{objectKey}
Request body:
binary file content
Headers:
Content-Type: image/jpeg
Content-Length: 1048576
x-checksum-sha256: abc123
Download Object
GET /buckets/{bucketName}/objects/{objectKey}
Delete Object
DELETE /buckets/{bucketName}/objects/{objectKey}
List Objects
GET /buckets/{bucketName}/objects?prefix=photos/2026/&limit=1000&cursor=xxx
Multipart Upload
POST /buckets/{bucketName}/objects/{objectKey}/multipart
PUT /multipart/{uploadId}/part/{partNumber}
POST /multipart/{uploadId}/complete
👉 Interview Answer
The core APIs are create bucket, upload object, download object, delete object, and list objects by prefix.
For large files, I would support multipart upload, where the client uploads file parts independently and then completes the upload once all parts are stored.
5️⃣ Data Model
Bucket Metadata Table
bucket (
bucket_id VARCHAR PRIMARY KEY,
bucket_name VARCHAR UNIQUE,
owner_id VARCHAR,
region VARCHAR,
created_at TIMESTAMP,
versioning_enabled BOOLEAN,
lifecycle_policy JSON
)
Object Metadata Table
object_metadata (
bucket_id VARCHAR,
object_key VARCHAR,
version_id VARCHAR,
size BIGINT,
content_type VARCHAR,
checksum VARCHAR,
storage_class VARCHAR,
status VARCHAR,
created_at TIMESTAMP,
updated_at TIMESTAMP,
physical_location JSON,
user_metadata JSON,
PRIMARY KEY (bucket_id, object_key, version_id)
)
Object Chunk Table
object_chunk (
object_id VARCHAR,
chunk_index INT,
chunk_id VARCHAR,
size BIGINT,
checksum VARCHAR,
storage_nodes ARRAY,
PRIMARY KEY (object_id, chunk_index)
)
Multipart Upload Table
multipart_upload (
upload_id VARCHAR PRIMARY KEY,
bucket_id VARCHAR,
object_key VARCHAR,
owner_id VARCHAR,
status VARCHAR,
created_at TIMESTAMP
)
👉 Interview Answer
I would separate object metadata from object data.
Metadata stores object key, size, checksum, version, permissions, and physical location.
The actual object bytes are stored in distributed storage nodes, often split into chunks for large objects.
6️⃣ High-Level Architecture
Client
→ API Gateway
→ Auth Service
→ Metadata Service
→ Storage Coordinator
→ Storage Nodes
→ Replication Service
→ Background Repair / Lifecycle Workers
Main Components
API Gateway
- Request routing
- Authentication
- Rate limiting
- TLS termination
Metadata Service
- Bucket metadata
- Object metadata
- Version metadata
- Prefix listing
- Access policy references
Storage Coordinator
- Chooses storage nodes
- Splits large objects into chunks
- Coordinates replication
- Verifies checksums
Storage Nodes
- Store object chunks
- Serve reads
- Report health
Background Workers
- Replication repair
- Garbage collection
- Lifecycle transitions
- Expired object deletion
👉 Interview Answer
I would separate metadata services from storage nodes.
Metadata service handles bucket and object metadata, while storage nodes store the actual bytes.
A storage coordinator manages chunk placement, replication, checksum verification, and object assembly during reads.
7️⃣ Upload Flow
Basic Upload Flow
Client uploads object
→ API Gateway authenticates request
→ Metadata Service validates bucket and permissions
→ Storage Coordinator selects storage nodes
→ Object data written to storage nodes
→ Checksums verified
→ Object metadata committed
→ Response returned to client
Important Design Choice
Commit metadata only after data is safely stored.
Why?
- Prevent metadata pointing to missing data
- Ensure object is readable after successful upload
- Improve durability semantics
👉 Interview Answer
During upload, I would first authenticate the request and validate bucket permissions.
Then the storage coordinator writes object data to storage nodes and verifies checksums.
Only after the data is durably stored would the metadata service commit the object metadata.
This prevents successful metadata writes from pointing to missing data.
8️⃣ Download Flow
Basic Download Flow
Client requests object
→ API Gateway authenticates request
→ Metadata Service fetches object metadata
→ Storage Coordinator locates chunks
→ Storage nodes return data
→ Data streamed back to client
Range Read
Support:
Range: bytes=0-1048575
Used for:
- Video streaming
- Resume download
- Large file access
- Partial reads
👉 Interview Answer
For downloads, the system first checks permissions and fetches object metadata.
Then it locates the physical chunks and streams data from storage nodes back to the client.
I would also support range reads, which are important for large files, media streaming, and resumable downloads.
9️⃣ Multipart Upload
Why Needed?
Large files are hard to upload in one request.
Problems:
- Network failure
- Timeout
- Retry cost
- Poor parallelism
Multipart Flow
Initiate multipart upload
→ Upload parts independently
→ Store each part with checksum
→ Complete multipart upload
→ Assemble metadata
→ Make object visible
Benefits
- Parallel upload
- Resume failed parts
- Better throughput
- Lower retry cost
Completion Rule
Object should not become visible until upload is completed.
👉 Interview Answer
Multipart upload allows clients to split a large file into parts and upload those parts independently.
This improves throughput, supports retries for individual parts, and avoids restarting the entire upload after a failure.
The object only becomes visible after the complete operation succeeds.
🔟 Partitioning and Placement
Metadata Partitioning
Partition metadata by:
bucket_id + object_key hash
or:
bucket_id + prefix
Object Data Placement
Storage coordinator chooses nodes based on:
- Available capacity
- Node health
- Rack / availability zone
- Region
- Replication factor
- Load balancing
Avoiding Hot Prefixes
If many objects share the same prefix:
logs/2026/05/02/...
a prefix-based partition may become hot.
Strategies:
- Hash object key
- Use virtual partitions
- Split hot partitions
- Add random prefix for high-write workloads
👉 Interview Answer
Metadata partitioning is critical because listing and object lookup both depend on metadata.
I would partition object metadata by bucket and object key hash, while also supporting efficient prefix listing.
For object data placement, I would distribute replicas across different nodes, racks, or availability zones to improve durability.
1️⃣1️⃣ Replication and Durability
Replication
Example:
replication factor = 3
Each object chunk is stored on multiple storage nodes.
Placement Rule
Replicas should be placed across:
- Different disks
- Different nodes
- Different racks
- Different availability zones
Alternative: Erasure Coding
Instead of full replication:
split data into k data blocks + m parity blocks
Example:
10 data blocks + 4 parity blocks
Can tolerate several failures with lower storage overhead.
Replication vs Erasure Coding
| Strategy | Pros | Cons |
|---|---|---|
| Replication | Simple, fast reads | Higher storage cost |
| Erasure coding | Lower storage cost | More complex, slower recovery |
👉 Interview Answer
For durability, I would store multiple copies of object chunks across different failure domains.
Replication is simpler and provides fast recovery, but it costs more storage.
For large cold objects, erasure coding can reduce storage cost while still providing high durability.
1️⃣2️⃣ Consistency Model
Stronger Consistency Needed For
- Read-after-write for newly uploaded object
- Delete correctness
- Permission checks
- Bucket metadata
- Object version metadata
Eventual Consistency Acceptable For
- Cross-region replication
- Lifecycle transitions
- Storage class changes
- Analytics and inventory reports
Versioning
If enabled:
same object_key can have multiple version_id values
Benefits:
- Recover deleted objects
- Protect against accidental overwrite
- Support auditability
👉 Interview Answer
I would aim for read-after-write consistency for newly uploaded objects in the same region.
Object metadata needs careful consistency, because clients expect an uploaded object to be readable after success.
Cross-region replication and lifecycle transitions can be eventually consistent.
Versioning can help protect against accidental deletes or overwrites.
1️⃣3️⃣ Listing Objects
API
GET /bucket/photos?prefix=2026/&limit=1000&cursor=xxx
Challenges
- Buckets may contain billions of objects
- Prefix listing can become expensive
- Results need pagination
- Newly written objects may appear with slight delay depending on consistency model
Strategies
- Store object metadata sorted by bucket + key
- Use cursor-based pagination
- Maintain prefix indexes
- Avoid scanning full bucket
👉 Interview Answer
Listing objects is a metadata query, not a storage-node data query.
Since buckets can contain billions of objects, listing must be paginated and should use metadata indexes sorted by bucket and object key.
Prefix listing should avoid scanning the entire bucket.
1️⃣4️⃣ Security and Access Control
Access Control Options
- Bucket policy
- Object ACL
- IAM-style permissions
- Pre-signed URLs
- Temporary credentials
Pre-signed URL
Allows temporary access:
GET object allowed until expiration time
Useful for:
- Direct browser upload
- Temporary file sharing
- Reducing application server load
Encryption
Support:
- TLS in transit
- Server-side encryption
- Client-side encryption
- Key management service integration
👉 Interview Answer
Security is critical for object storage.
I would support bucket policies, IAM-style permissions, object-level access control, and pre-signed URLs for temporary access.
Data should be encrypted in transit and at rest, with integration to a key management service.
1️⃣5️⃣ Lifecycle and Storage Classes
Storage Classes
Examples:
standard
infrequent_access
archive
deep_archive
Lifecycle Rules
Examples:
Move objects older than 30 days to infrequent access
Move objects older than 180 days to archive
Delete temporary files after 7 days
Background Workers
Lifecycle workers:
- Scan metadata
- Find eligible objects
- Move objects between storage classes
- Delete expired objects
- Clean incomplete multipart uploads
👉 Interview Answer
To control cost, I would support lifecycle policies and storage classes.
Frequently accessed objects stay in standard storage, while older or rarely accessed objects can move to cheaper archive storage.
Background workers enforce lifecycle rules asynchronously.
1️⃣6️⃣ CDN and Edge Caching
Why CDN?
Object storage often serves static content.
Examples:
- Images
- Videos
- Downloads
- Public assets
Flow
Client
→ CDN Edge
→ Object Storage Origin
Benefits
- Lower latency
- Less origin load
- Better global performance
- Reduced bandwidth cost
👉 Interview Answer
For public or frequently accessed objects, I would integrate object storage with a CDN.
CDN caches objects close to users, reducing download latency and origin load.
Cache invalidation and signed URLs may be needed for private or frequently updated content.
1️⃣7️⃣ Failure Handling
Common Failures
- Storage node failure
- Disk failure
- Metadata service unavailable
- Partial upload failure
- Replication lag
- Checksum mismatch
- Hot partition
- Region outage
Strategies
- Replication across failure domains
- Erasure coding
- Checksum validation
- Background repair
- Retry failed parts
- Quorum writes for metadata
- Failover to replicas
- Cross-region replication for disaster recovery
👉 Interview Answer
Object storage must assume disks and nodes will fail.
I would use replication or erasure coding, checksum validation, background repair, and placement across failure domains.
For large uploads, multipart upload allows failed parts to be retried independently.
For disaster recovery, cross-region replication can be used.
1️⃣8️⃣ Observability
Key Metrics
- Upload latency
- Download latency
- Error rate
- Metadata query latency
- Storage node health
- Replication lag
- Disk utilization
- Object count
- Bucket size
- Failed checksum count
- Lifecycle job lag
- Hot partition count
Important Dashboards
- Storage capacity
- API latency
- Metadata service health
- Replication health
- Node failures
- Lifecycle progress
- Cost by bucket / tenant
👉 Interview Answer
Observability is essential for object storage.
I would monitor upload and download latency, storage node health, replication lag, disk utilization, metadata query latency, checksum failures, and lifecycle job progress.
These metrics help detect durability risks, performance issues, and cost growth.
1️⃣9️⃣ End-to-End Flow
Upload Flow
Client uploads object
→ Authenticate request
→ Validate bucket and permissions
→ Storage coordinator chooses nodes
→ Write object chunks
→ Verify checksum
→ Replicate chunks
→ Commit object metadata
→ Return success
Download Flow
Client requests object
→ Authenticate request
→ Fetch object metadata
→ Locate chunks
→ Read from storage nodes
→ Stream object back to client
Multipart Upload Flow
Initiate upload
→ Upload parts in parallel
→ Store each part
→ Complete upload
→ Commit final object metadata
→ Make object visible
Lifecycle Flow
Lifecycle worker scans metadata
→ Finds eligible objects
→ Moves object to cheaper storage class
→ Updates metadata
→ Deletes expired objects
Key Insight
S3-like storage is not a file system — it is a highly durable, distributed object store.
🧠 Staff-Level Answer (Final)
👉 Interview Answer (Full Version)
When designing an S3-like file storage system, I think of it as a distributed object storage system.
The core abstraction is bucket, object, and key. A bucket is a namespace, and an object contains data, metadata, checksum, version, and storage class.
I would separate metadata from object data. Metadata is stored in a strongly consistent metadata service, while object bytes are stored across distributed storage nodes.
During upload, the system authenticates the request, validates bucket permissions, writes object chunks to storage nodes, verifies checksums, replicates the data, and only then commits metadata.
For downloads, the system reads metadata, locates object chunks, and streams data from storage nodes.
For large files, I would support multipart upload, allowing clients to upload parts independently and retry failed parts without restarting the whole upload.
For durability, object chunks should be replicated across failure domains, or stored using erasure coding for cost-efficient durability.
Metadata should support efficient object lookup and prefix listing, usually with pagination and cursor-based listing.
For security, I would support IAM-style permissions, bucket policies, object ACLs, pre-signed URLs, encryption at rest and in transit, and integration with a key management service.
Lifecycle policies and storage classes help control cost by moving old or rarely accessed objects to cheaper storage.
The main trade-offs are durability, availability, consistency, storage cost, latency, and metadata scalability.
Ultimately, the goal is to store massive amounts of object data durably and cost-effectively, while providing scalable access, strong security, and predictable performance.
⭐ Final Insight
S3-like File Storage 的核心不是传统文件系统, 而是一个以 bucket 和 object 为抽象的高耐久分布式对象存储系统。
中文部分
🎯 Design File Storage (S3-like)
1️⃣ 核心框架
在设计 S3-like File Storage 时,我通常从以下几个方面来分析:
- 核心对象存储模型:bucket、object、key
- 上传、下载、删除和 list 流程
- Metadata 和 object data 分离
- 分片和副本
- Durability 和 availability
- 大文件处理和 multipart upload
- Consistency、versioning 和 lifecycle policies
- Security、access control 和 cost control
2️⃣ 核心需求
功能需求
- 创建 buckets
- 上传 objects
- 下载 objects
- 删除 objects
- 按 prefix 列出 objects
- 支持 object metadata
- 支持大文件
- 支持 multipart upload
- 支持 access control
- 支持 versioning
- 支持 lifecycle policies
非功能需求
- 极高 durability
- 高可用
- 可扩展存储容量
- 高写入吞吐
- 高效 range reads
- 长期存储成本可控
- 强安全和访问控制
👉 面试回答
S3-like storage system 是一个 object storage system。
它将数据作为 objects 存储在 buckets 中, 每个 object 通过 key 标识。
核心挑战包括 durability、availability、 可扩展 metadata 管理、大文件上传、 replication、access control 和低成本长期存储。
3️⃣ 核心概念
Bucket
Bucket 是 object 的 namespace。
示例:
bucket = user-photos
Object
Object 包含:
- Object data
- Object key
- Metadata
- Version ID
- Checksum
- Size
- Storage class
Object Key
示例:
photos/2026/05/image.jpg
注意:
- Object key 看起来像路径
- 但 object storage 不是传统文件系统
- Prefix 用于 list 和组织对象
👉 面试回答
我会围绕 bucket 和 object 建模。
Bucket 提供 namespace, 每个 object 通过 key 标识。
Key 可能看起来像文件路径, 但系统内部通常是 flat key-value object store, 不是层级文件系统。
4️⃣ 主要 API
Create Bucket
PUT /buckets/{bucketName}
Upload Object
PUT /buckets/{bucketName}/objects/{objectKey}
Request body:
binary file content
Headers:
Content-Type: image/jpeg
Content-Length: 1048576
x-checksum-sha256: abc123
Download Object
GET /buckets/{bucketName}/objects/{objectKey}
Delete Object
DELETE /buckets/{bucketName}/objects/{objectKey}
List Objects
GET /buckets/{bucketName}/objects?prefix=photos/2026/&limit=1000&cursor=xxx
Multipart Upload
POST /buckets/{bucketName}/objects/{objectKey}/multipart
PUT /multipart/{uploadId}/part/{partNumber}
POST /multipart/{uploadId}/complete
👉 面试回答
核心 API 包括 create bucket、upload object、 download object、delete object 和按 prefix list objects。
对于大文件, 我会支持 multipart upload, 让 client 可以独立上传多个 parts, 最后再 complete upload。
5️⃣ 数据模型
Bucket Metadata Table
bucket (
bucket_id VARCHAR PRIMARY KEY,
bucket_name VARCHAR UNIQUE,
owner_id VARCHAR,
region VARCHAR,
created_at TIMESTAMP,
versioning_enabled BOOLEAN,
lifecycle_policy JSON
)
Object Metadata Table
object_metadata (
bucket_id VARCHAR,
object_key VARCHAR,
version_id VARCHAR,
size BIGINT,
content_type VARCHAR,
checksum VARCHAR,
storage_class VARCHAR,
status VARCHAR,
created_at TIMESTAMP,
updated_at TIMESTAMP,
physical_location JSON,
user_metadata JSON,
PRIMARY KEY (bucket_id, object_key, version_id)
)
Object Chunk Table
object_chunk (
object_id VARCHAR,
chunk_index INT,
chunk_id VARCHAR,
size BIGINT,
checksum VARCHAR,
storage_nodes ARRAY,
PRIMARY KEY (object_id, chunk_index)
)
Multipart Upload Table
multipart_upload (
upload_id VARCHAR PRIMARY KEY,
bucket_id VARCHAR,
object_key VARCHAR,
owner_id VARCHAR,
status VARCHAR,
created_at TIMESTAMP
)
👉 面试回答
我会将 object metadata 和 object data 分开。
Metadata 存储 object key、size、checksum、 version、permissions 和 physical location。
真正的 object bytes 会存储在 distributed storage nodes 中, 对于大 object,通常会拆成 chunks。
6️⃣ High-Level Architecture
Client
→ API Gateway
→ Auth Service
→ Metadata Service
→ Storage Coordinator
→ Storage Nodes
→ Replication Service
→ Background Repair / Lifecycle Workers
Main Components
API Gateway
- Request routing
- Authentication
- Rate limiting
- TLS termination
Metadata Service
- Bucket metadata
- Object metadata
- Version metadata
- Prefix listing
- Access policy references
Storage Coordinator
- Chooses storage nodes
- Splits large objects into chunks
- Coordinates replication
- Verifies checksums
Storage Nodes
- Store object chunks
- Serve reads
- Report health
Background Workers
- Replication repair
- Garbage collection
- Lifecycle transitions
- Expired object deletion
👉 面试回答
我会将 metadata services 和 storage nodes 分离。
Metadata service 负责 bucket 和 object metadata, storage nodes 负责存储实际 bytes。
Storage coordinator 负责 chunk placement、 replication、checksum verification, 以及读取时的 object assembly。
7️⃣ Upload Flow
Basic Upload Flow
Client uploads object
→ API Gateway authenticates request
→ Metadata Service validates bucket and permissions
→ Storage Coordinator selects storage nodes
→ Object data written to storage nodes
→ Checksums verified
→ Object metadata committed
→ Response returned to client
Important Design Choice
只有数据安全写入后才 commit metadata。
原因:
- 防止 metadata 指向不存在的数据
- 确保 upload 成功后 object 可读
- 提升 durability semantics
👉 面试回答
在 upload 过程中, 我会先验证请求身份和 bucket 权限。
然后 storage coordinator 将 object data 写入 storage nodes, 并验证 checksums。
只有数据已经 durable stored 之后, metadata service 才会 commit object metadata。
这样可以避免 metadata 写入成功, 但实际数据不存在的问题。
8️⃣ Download Flow
Basic Download Flow
Client requests object
→ API Gateway authenticates request
→ Metadata Service fetches object metadata
→ Storage Coordinator locates chunks
→ Storage nodes return data
→ Data streamed back to client
Range Read
支持:
Range: bytes=0-1048575
用于:
- Video streaming
- Resume download
- Large file access
- Partial reads
👉 面试回答
对于 download, 系统先检查权限并获取 object metadata。
然后定位实际 chunks, 从 storage nodes 读取数据, 并流式返回给 client。
我也会支持 range reads, 因为它对大文件、视频播放 和断点续传都很重要。
9️⃣ Multipart Upload
为什么需要?
大文件很难通过一个 request 上传完成。
问题:
- Network failure
- Timeout
- Retry cost
- 并行度差
Multipart Flow
Initiate multipart upload
→ Upload parts independently
→ Store each part with checksum
→ Complete multipart upload
→ Assemble metadata
→ Make object visible
Benefits
- 并行上传
- 失败 part 可单独重试
- 更高吞吐
- 更低 retry 成本
Completion Rule
Object 在 complete upload 成功前不应该可见。
👉 面试回答
Multipart upload 允许 client 将大文件拆成多个 parts, 并独立上传每个 part。
这样可以提高吞吐, 支持单个 part 的重试, 避免失败后重新上传整个文件。
Object 只有在 complete 操作成功后才会对外可见。
🔟 Partitioning and Placement
Metadata Partitioning
Metadata 可以按以下方式分片:
bucket_id + object_key hash
或者:
bucket_id + prefix
Object Data Placement
Storage coordinator 根据以下因素选择 nodes:
- Available capacity
- Node health
- Rack / availability zone
- Region
- Replication factor
- Load balancing
Avoiding Hot Prefixes
如果大量 objects 使用相同 prefix:
logs/2026/05/02/...
prefix-based partition 可能成为热点。
策略:
- Hash object key
- 使用 virtual partitions
- Split hot partitions
- 对高写入 workload 添加 random prefix
👉 面试回答
Metadata partitioning 非常关键, 因为 object lookup 和 listing 都依赖 metadata。
我会按 bucket 和 object key hash 对 metadata 分片, 同时支持高效 prefix listing。
对于 object data placement, 我会将 replicas 分布到不同 nodes、racks 或 availability zones,以提升 durability。
1️⃣1️⃣ Replication and Durability
Replication
示例:
replication factor = 3
每个 object chunk 存在多个 storage nodes 上。
Placement Rule
Replicas 应该分布在:
- Different disks
- Different nodes
- Different racks
- Different availability zones
Alternative: Erasure Coding
不使用完整副本,而是:
split data into k data blocks + m parity blocks
示例:
10 data blocks + 4 parity blocks
可以在更低存储成本下容忍多个 failure。
Replication vs Erasure Coding
| Strategy | 优点 | 缺点 |
|---|---|---|
| Replication | 简单,读快 | 存储成本高 |
| Erasure coding | 存储成本低 | 更复杂,恢复更慢 |
👉 面试回答
为了 durability, 我会将 object chunks 存储多个副本, 并分布在不同 failure domains。
Replication 更简单,恢复速度快, 但存储成本更高。
对于大型冷数据, erasure coding 可以在保证高 durability 的同时 降低存储成本。
1️⃣2️⃣ Consistency Model
需要较强一致性的场景
- 新上传 object 的 read-after-write
- Delete correctness
- Permission checks
- Bucket metadata
- Object version metadata
可以最终一致的场景
- Cross-region replication
- Lifecycle transitions
- Storage class changes
- Analytics and inventory reports
Versioning
如果启用:
same object_key can have multiple version_id values
好处:
- 恢复误删 objects
- 防止意外 overwrite
- 支持 auditability
👉 面试回答
我会尽量在同一 region 内支持新 object 的 read-after-write consistency。
Object metadata 需要谨慎保证一致性, 因为 client 期望上传成功后 object 可以立即读取。
Cross-region replication 和 lifecycle transitions 可以最终一致。
Versioning 可以保护用户免受误删或误覆盖的影响。
1️⃣3️⃣ Listing Objects
API
GET /bucket/photos?prefix=2026/&limit=1000&cursor=xxx
Challenges
- 一个 bucket 可能有数十亿 objects
- Prefix listing 可能很昂贵
- 结果需要 pagination
- 新写入 object 是否立刻出现在 list 里取决于 consistency model
Strategies
- Object metadata 按 bucket + key 排序存储
- 使用 cursor-based pagination
- 维护 prefix indexes
- 避免扫描整个 bucket
👉 面试回答
Listing objects 是 metadata query, 不是 storage-node data query。
因为 bucket 可能包含数十亿 objects, listing 必须分页, 并且应该使用按 bucket 和 object key 排序的 metadata index。
Prefix listing 不应该扫描整个 bucket。
1️⃣4️⃣ Security and Access Control
Access Control Options
- Bucket policy
- Object ACL
- IAM-style permissions
- Pre-signed URLs
- Temporary credentials
Pre-signed URL
允许临时访问:
GET object allowed until expiration time
适用于:
- Browser direct upload
- Temporary file sharing
- 减少 application server load
Encryption
支持:
- TLS in transit
- Server-side encryption
- Client-side encryption
- Key management service integration
👉 面试回答
Object storage 的安全非常关键。
我会支持 bucket policies、IAM-style permissions、 object-level access control, 以及用于临时访问的 pre-signed URLs。
数据应该支持传输中和静态加密, 并集成 key management service。
1️⃣5️⃣ Lifecycle and Storage Classes
Storage Classes
示例:
standard
infrequent_access
archive
deep_archive
Lifecycle Rules
示例:
Move objects older than 30 days to infrequent access
Move objects older than 180 days to archive
Delete temporary files after 7 days
Background Workers
Lifecycle workers:
- Scan metadata
- Find eligible objects
- Move objects between storage classes
- Delete expired objects
- Clean incomplete multipart uploads
👉 面试回答
为了控制成本, 我会支持 lifecycle policies 和 storage classes。
频繁访问的 objects 保存在 standard storage, 旧的或很少访问的 objects 可以迁移到更便宜的 archive storage。
Background workers 会异步执行 lifecycle rules。
1️⃣6️⃣ CDN and Edge Caching
为什么需要 CDN?
Object storage 经常用于服务静态内容。
例如:
- Images
- Videos
- Downloads
- Public assets
Flow
Client
→ CDN Edge
→ Object Storage Origin
Benefits
- 降低延迟
- 减少 origin load
- 提升全球访问性能
- 降低带宽成本
👉 面试回答
对于 public 或频繁访问的 objects, 我会将 object storage 和 CDN 集成。
CDN 可以将 objects 缓存在靠近用户的 edge, 降低下载延迟并减少 origin load。
对于 private 或频繁更新的内容, 可能需要 cache invalidation 和 signed URLs。
1️⃣7️⃣ Failure Handling
常见故障
- Storage node failure
- Disk failure
- Metadata service unavailable
- Partial upload failure
- Replication lag
- Checksum mismatch
- Hot partition
- Region outage
Strategies
- Replication across failure domains
- Erasure coding
- Checksum validation
- Background repair
- Retry failed parts
- Metadata quorum writes
- Failover to replicas
- Cross-region replication for disaster recovery
👉 面试回答
Object storage 必须假设 disk 和 node 会失败。
我会使用 replication 或 erasure coding、 checksum validation、background repair, 并将数据放置在不同 failure domains。
对于大文件上传, multipart upload 允许失败 part 单独重试。
对于 disaster recovery, 可以使用 cross-region replication。
1️⃣8️⃣ Observability
Key Metrics
- Upload latency
- Download latency
- Error rate
- Metadata query latency
- Storage node health
- Replication lag
- Disk utilization
- Object count
- Bucket size
- Failed checksum count
- Lifecycle job lag
- Hot partition count
Important Dashboards
- Storage capacity
- API latency
- Metadata service health
- Replication health
- Node failures
- Lifecycle progress
- Cost by bucket / tenant
👉 面试回答
Object storage 的可观测性非常重要。
我会监控 upload 和 download latency、 storage node health、replication lag、 disk utilization、metadata query latency、 checksum failures 和 lifecycle job progress。
这些指标可以帮助发现 durability 风险、 性能问题和成本增长。
1️⃣9️⃣ End-to-End Flow
Upload Flow
Client uploads object
→ Authenticate request
→ Validate bucket and permissions
→ Storage coordinator chooses nodes
→ Write object chunks
→ Verify checksum
→ Replicate chunks
→ Commit object metadata
→ Return success
Download Flow
Client requests object
→ Authenticate request
→ Fetch object metadata
→ Locate chunks
→ Read from storage nodes
→ Stream object back to client
Multipart Upload Flow
Initiate upload
→ Upload parts in parallel
→ Store each part
→ Complete upload
→ Commit final object metadata
→ Make object visible
Lifecycle Flow
Lifecycle worker scans metadata
→ Finds eligible objects
→ Moves object to cheaper storage class
→ Updates metadata
→ Deletes expired objects
Key Insight
S3-like storage 不是传统 file system, 而是高耐久的 distributed object store。
🧠 Staff-Level Answer(最终版)
👉 面试回答(完整背诵版)
在设计 S3-like file storage system 时, 我会把它看作一个 distributed object storage system。
核心抽象是 bucket、object 和 key。 Bucket 是 namespace, object 包含 data、metadata、checksum、 version 和 storage class。
我会将 metadata 和 object data 分离。 Metadata 存储在一致性更强的 metadata service 中, object bytes 则分布在多个 storage nodes 上。
在 upload 过程中, 系统会认证请求, 校验 bucket 权限, 将 object 拆成 chunks 写入 storage nodes, 验证 checksum, 复制数据, 然后才 commit metadata。
对于 download, 系统读取 metadata, 定位 object chunks, 并从 storage nodes 流式返回数据。
对于大文件, 我会支持 multipart upload, 让 client 独立上传多个 parts, 并在 part 失败时只重试失败部分, 不需要重新上传整个文件。
为了 durability, object chunks 应该跨 failure domains 复制, 或者使用 erasure coding 以更低成本获得高 durability。
Metadata 需要支持高效 object lookup 和 prefix listing, 通常通过 pagination 和 cursor-based listing 实现。
在安全方面, 我会支持 IAM-style permissions、bucket policies、 object ACLs、pre-signed URLs、 传输和静态加密, 以及 key management service 集成。
Lifecycle policies 和 storage classes 可以通过将旧数据或低频访问数据迁移到便宜存储 来控制成本。
核心权衡包括 durability、availability、 consistency、storage cost、latency 和 metadata scalability。
最终目标是以高耐久、低成本的方式 存储海量 object data, 同时提供可扩展访问、强安全性和稳定性能。
⭐ Final Insight
S3-like File Storage 的核心不是传统文件系统, 而是一个以 bucket 和 object 为抽象的高耐久分布式对象存储系统。
Implement