d&d-t System Design Deep Dive ·

🎯 Design File Storage (S3-like)

1️⃣ Core Framework

When discussing S3-like File Storage design, I frame it as:

Core object storage model: bucket, object, key
Upload, download, delete, and list flows
Metadata and object data separation
Partitioning and replication
Durability and availability
Large file handling and multipart upload
Consistency, versioning, and lifecycle policies
Security, access control, and cost control

2️⃣ Core Requirements

Functional Requirements

Create buckets
Upload objects
Download objects
Delete objects
List objects by prefix
Support object metadata
Support large files
Support multipart upload
Support access control
Support versioning
Support lifecycle policies

Non-functional Requirements

Very high durability
High availability
Scalable storage capacity
High write throughput
Efficient range reads
Cost-effective long-term storage
Strong security and access control

👉 Interview Answer

An S3-like storage system is an object storage system.

It stores data as objects inside buckets, where each object is identified by a key.

The main challenges are durability, availability, scalable metadata management, large file upload, replication, access control, and cost-efficient storage.

3️⃣ Core Concepts

Bucket

A bucket is a namespace for objects.

Example:

bucket = user-photos

Object

An object contains:

Object data
Object key
Metadata
Version ID
Checksum
Size
Storage class

Object Key

Example:

photos/2026/05/image.jpg

Important:

Object keys look like paths
But object storage is not a traditional file system
Prefixes are used for listing and organization

👉 Interview Answer

I would model the system around buckets and objects.

A bucket provides a namespace, and each object is identified by a key.

The key may look like a file path, but internally the system is usually a flat key-value object store, not a hierarchical file system.

4️⃣ Main APIs

Create Bucket

PUT /buckets/{bucketName}

Upload Object

PUT /buckets/{bucketName}/objects/{objectKey}

Request body:

binary file content

Headers:

Content-Type: image/jpeg
Content-Length: 1048576
x-checksum-sha256: abc123

Download Object

GET /buckets/{bucketName}/objects/{objectKey}

Delete Object

DELETE /buckets/{bucketName}/objects/{objectKey}

List Objects

GET /buckets/{bucketName}/objects?prefix=photos/2026/&limit=1000&cursor=xxx

Multipart Upload

POST /buckets/{bucketName}/objects/{objectKey}/multipart
PUT /multipart/{uploadId}/part/{partNumber}
POST /multipart/{uploadId}/complete

👉 Interview Answer

The core APIs are create bucket, upload object, download object, delete object, and list objects by prefix.

For large files, I would support multipart upload, where the client uploads file parts independently and then completes the upload once all parts are stored.

5️⃣ Data Model

Bucket Metadata Table

bucket (
  bucket_id VARCHAR PRIMARY KEY,
  bucket_name VARCHAR UNIQUE,
  owner_id VARCHAR,
  region VARCHAR,
  created_at TIMESTAMP,
  versioning_enabled BOOLEAN,
  lifecycle_policy JSON
)

Object Metadata Table

object_metadata (
  bucket_id VARCHAR,
  object_key VARCHAR,
  version_id VARCHAR,
  size BIGINT,
  content_type VARCHAR,
  checksum VARCHAR,
  storage_class VARCHAR,
  status VARCHAR,
  created_at TIMESTAMP,
  updated_at TIMESTAMP,
  physical_location JSON,
  user_metadata JSON,
  PRIMARY KEY (bucket_id, object_key, version_id)
)

Object Chunk Table

object_chunk (
  object_id VARCHAR,
  chunk_index INT,
  chunk_id VARCHAR,
  size BIGINT,
  checksum VARCHAR,
  storage_nodes ARRAY,
  PRIMARY KEY (object_id, chunk_index)
)

Multipart Upload Table

multipart_upload (
  upload_id VARCHAR PRIMARY KEY,
  bucket_id VARCHAR,
  object_key VARCHAR,
  owner_id VARCHAR,
  status VARCHAR,
  created_at TIMESTAMP
)

👉 Interview Answer

I would separate object metadata from object data.

Metadata stores object key, size, checksum, version, permissions, and physical location.

The actual object bytes are stored in distributed storage nodes, often split into chunks for large objects.

6️⃣ High-Level Architecture

Client
→ API Gateway
→ Auth Service
→ Metadata Service
→ Storage Coordinator
→ Storage Nodes
→ Replication Service
→ Background Repair / Lifecycle Workers

Main Components

API Gateway

Request routing
Authentication
Rate limiting
TLS termination

Metadata Service

Bucket metadata
Object metadata
Version metadata
Prefix listing
Access policy references

Storage Coordinator

Chooses storage nodes
Splits large objects into chunks
Coordinates replication
Verifies checksums

Storage Nodes

Store object chunks
Serve reads
Report health

Background Workers

Replication repair
Garbage collection
Lifecycle transitions
Expired object deletion

👉 Interview Answer

I would separate metadata services from storage nodes.

Metadata service handles bucket and object metadata, while storage nodes store the actual bytes.

A storage coordinator manages chunk placement, replication, checksum verification, and object assembly during reads.

7️⃣ Upload Flow

Basic Upload Flow

Client uploads object
→ API Gateway authenticates request
→ Metadata Service validates bucket and permissions
→ Storage Coordinator selects storage nodes
→ Object data written to storage nodes
→ Checksums verified
→ Object metadata committed
→ Response returned to client

Important Design Choice

Commit metadata only after data is safely stored.

Why?

Prevent metadata pointing to missing data
Ensure object is readable after successful upload
Improve durability semantics

👉 Interview Answer

During upload, I would first authenticate the request and validate bucket permissions.

Then the storage coordinator writes object data to storage nodes and verifies checksums.

Only after the data is durably stored would the metadata service commit the object metadata.

This prevents successful metadata writes from pointing to missing data.

8️⃣ Download Flow

Basic Download Flow

Client requests object
→ API Gateway authenticates request
→ Metadata Service fetches object metadata
→ Storage Coordinator locates chunks
→ Storage nodes return data
→ Data streamed back to client

Range Read

Support:

Range: bytes=0-1048575

Used for:

Video streaming
Resume download
Large file access
Partial reads

👉 Interview Answer

For downloads, the system first checks permissions and fetches object metadata.

Then it locates the physical chunks and streams data from storage nodes back to the client.

I would also support range reads, which are important for large files, media streaming, and resumable downloads.

9️⃣ Multipart Upload

Why Needed?

Large files are hard to upload in one request.

Problems:

Network failure
Timeout
Retry cost
Poor parallelism

Multipart Flow

Initiate multipart upload
→ Upload parts independently
→ Store each part with checksum
→ Complete multipart upload
→ Assemble metadata
→ Make object visible

Benefits

Parallel upload
Resume failed parts
Better throughput
Lower retry cost

Completion Rule

Object should not become visible until upload is completed.

👉 Interview Answer

Multipart upload allows clients to split a large file into parts and upload those parts independently.

This improves throughput, supports retries for individual parts, and avoids restarting the entire upload after a failure.

The object only becomes visible after the complete operation succeeds.

🔟 Partitioning and Placement

Metadata Partitioning

Partition metadata by:

bucket_id + object_key hash

or:

bucket_id + prefix

Object Data Placement

Storage coordinator chooses nodes based on:

Available capacity
Node health
Rack / availability zone
Region
Replication factor
Load balancing

Avoiding Hot Prefixes

If many objects share the same prefix:

logs/2026/05/02/...

a prefix-based partition may become hot.

Strategies:

Hash object key
Use virtual partitions
Split hot partitions
Add random prefix for high-write workloads

👉 Interview Answer

Metadata partitioning is critical because listing and object lookup both depend on metadata.

I would partition object metadata by bucket and object key hash, while also supporting efficient prefix listing.

For object data placement, I would distribute replicas across different nodes, racks, or availability zones to improve durability.

1️⃣1️⃣ Replication and Durability

Replication

Example:

replication factor = 3

Each object chunk is stored on multiple storage nodes.

Placement Rule

Replicas should be placed across:

Different disks
Different nodes
Different racks
Different availability zones

Alternative: Erasure Coding

Instead of full replication:

split data into k data blocks + m parity blocks

Example:

10 data blocks + 4 parity blocks

Can tolerate several failures with lower storage overhead.

Replication vs Erasure Coding

Strategy	Pros	Cons
Replication	Simple, fast reads	Higher storage cost
Erasure coding	Lower storage cost	More complex, slower recovery

👉 Interview Answer

For durability, I would store multiple copies of object chunks across different failure domains.

Replication is simpler and provides fast recovery, but it costs more storage.

For large cold objects, erasure coding can reduce storage cost while still providing high durability.

1️⃣2️⃣ Consistency Model

Stronger Consistency Needed For

Read-after-write for newly uploaded object
Delete correctness
Permission checks
Bucket metadata
Object version metadata

Eventual Consistency Acceptable For

Cross-region replication
Lifecycle transitions
Storage class changes
Analytics and inventory reports

Versioning

If enabled:

same object_key can have multiple version_id values

Benefits:

Recover deleted objects
Protect against accidental overwrite
Support auditability

👉 Interview Answer

I would aim for read-after-write consistency for newly uploaded objects in the same region.

Object metadata needs careful consistency, because clients expect an uploaded object to be readable after success.

Cross-region replication and lifecycle transitions can be eventually consistent.

Versioning can help protect against accidental deletes or overwrites.

1️⃣3️⃣ Listing Objects

API

GET /bucket/photos?prefix=2026/&limit=1000&cursor=xxx

Challenges

Buckets may contain billions of objects
Prefix listing can become expensive
Results need pagination
Newly written objects may appear with slight delay depending on consistency model

Strategies

Store object metadata sorted by bucket + key
Use cursor-based pagination
Maintain prefix indexes
Avoid scanning full bucket

👉 Interview Answer

Listing objects is a metadata query, not a storage-node data query.

Since buckets can contain billions of objects, listing must be paginated and should use metadata indexes sorted by bucket and object key.

Prefix listing should avoid scanning the entire bucket.

1️⃣4️⃣ Security and Access Control

Access Control Options

Bucket policy
Object ACL
IAM-style permissions
Pre-signed URLs
Temporary credentials

Pre-signed URL

Allows temporary access:

GET object allowed until expiration time

Useful for:

Direct browser upload
Temporary file sharing
Reducing application server load

Encryption

Support:

TLS in transit
Server-side encryption
Client-side encryption
Key management service integration

👉 Interview Answer

Security is critical for object storage.

I would support bucket policies, IAM-style permissions, object-level access control, and pre-signed URLs for temporary access.

Data should be encrypted in transit and at rest, with integration to a key management service.

1️⃣5️⃣ Lifecycle and Storage Classes

Storage Classes

Examples:

standard
infrequent_access
archive
deep_archive

Lifecycle Rules

Examples:

Move objects older than 30 days to infrequent access
Move objects older than 180 days to archive
Delete temporary files after 7 days

Background Workers

Lifecycle workers:

Scan metadata
Find eligible objects
Move objects between storage classes
Delete expired objects
Clean incomplete multipart uploads

👉 Interview Answer

To control cost, I would support lifecycle policies and storage classes.

Frequently accessed objects stay in standard storage, while older or rarely accessed objects can move to cheaper archive storage.

Background workers enforce lifecycle rules asynchronously.

1️⃣6️⃣ CDN and Edge Caching

Why CDN?

Object storage often serves static content.

Examples:

Images
Videos
Downloads
Public assets

Flow

Client
→ CDN Edge
→ Object Storage Origin

Benefits

Lower latency
Less origin load
Better global performance
Reduced bandwidth cost

👉 Interview Answer

For public or frequently accessed objects, I would integrate object storage with a CDN.

CDN caches objects close to users, reducing download latency and origin load.

Cache invalidation and signed URLs may be needed for private or frequently updated content.

1️⃣7️⃣ Failure Handling

Common Failures

Storage node failure
Disk failure
Metadata service unavailable
Partial upload failure
Replication lag
Checksum mismatch
Hot partition
Region outage

Strategies

Replication across failure domains
Erasure coding
Checksum validation
Background repair
Retry failed parts
Quorum writes for metadata
Failover to replicas
Cross-region replication for disaster recovery

👉 Interview Answer

Object storage must assume disks and nodes will fail.

I would use replication or erasure coding, checksum validation, background repair, and placement across failure domains.

For large uploads, multipart upload allows failed parts to be retried independently.

For disaster recovery, cross-region replication can be used.

1️⃣8️⃣ Observability

Key Metrics

Upload latency
Download latency
Error rate
Metadata query latency
Storage node health
Replication lag
Disk utilization
Object count
Bucket size
Failed checksum count
Lifecycle job lag
Hot partition count

Important Dashboards

Storage capacity
API latency
Metadata service health
Replication health
Node failures
Lifecycle progress
Cost by bucket / tenant

👉 Interview Answer

Observability is essential for object storage.

I would monitor upload and download latency, storage node health, replication lag, disk utilization, metadata query latency, checksum failures, and lifecycle job progress.

These metrics help detect durability risks, performance issues, and cost growth.

1️⃣9️⃣ End-to-End Flow

Upload Flow

Client uploads object
→ Authenticate request
→ Validate bucket and permissions
→ Storage coordinator chooses nodes
→ Write object chunks
→ Verify checksum
→ Replicate chunks
→ Commit object metadata
→ Return success

Download Flow

Client requests object
→ Authenticate request
→ Fetch object metadata
→ Locate chunks
→ Read from storage nodes
→ Stream object back to client

Multipart Upload Flow

Initiate upload
→ Upload parts in parallel
→ Store each part
→ Complete upload
→ Commit final object metadata
→ Make object visible

Lifecycle Flow

Lifecycle worker scans metadata
→ Finds eligible objects
→ Moves object to cheaper storage class
→ Updates metadata
→ Deletes expired objects

Key Insight

S3-like storage is not a file system — it is a highly durable, distributed object store.

🧠 Staff-Level Answer (Final)

👉 Interview Answer (Full Version)

When designing an S3-like file storage system, I think of it as a distributed object storage system.

The core abstraction is bucket, object, and key. A bucket is a namespace, and an object contains data, metadata, checksum, version, and storage class.

I would separate metadata from object data. Metadata is stored in a strongly consistent metadata service, while object bytes are stored across distributed storage nodes.

During upload, the system authenticates the request, validates bucket permissions, writes object chunks to storage nodes, verifies checksums, replicates the data, and only then commits metadata.

For downloads, the system reads metadata, locates object chunks, and streams data from storage nodes.

For large files, I would support multipart upload, allowing clients to upload parts independently and retry failed parts without restarting the whole upload.

For durability, object chunks should be replicated across failure domains, or stored using erasure coding for cost-efficient durability.

Metadata should support efficient object lookup and prefix listing, usually with pagination and cursor-based listing.

For security, I would support IAM-style permissions, bucket policies, object ACLs, pre-signed URLs, encryption at rest and in transit, and integration with a key management service.

Lifecycle policies and storage classes help control cost by moving old or rarely accessed objects to cheaper storage.

The main trade-offs are durability, availability, consistency, storage cost, latency, and metadata scalability.

Ultimately, the goal is to store massive amounts of object data durably and cost-effectively, while providing scalable access, strong security, and predictable performance.

⭐ Final Insight

S3-like File Storage 的核心不是传统文件系统，而是一个以 bucket 和 object 为抽象的高耐久分布式对象存储系统。

中文部分

🎯 Design File Storage (S3-like)

1️⃣ 核心框架

在设计 S3-like File Storage 时，我通常从以下几个方面来分析：

核心对象存储模型：bucket、object、key
上传、下载、删除和 list 流程
Metadata 和 object data 分离
分片和副本
Durability 和 availability
大文件处理和 multipart upload
Consistency、versioning 和 lifecycle policies
Security、access control 和 cost control

2️⃣ 核心需求

功能需求

创建 buckets
上传 objects
下载 objects
删除 objects
按 prefix 列出 objects
支持 object metadata
支持大文件
支持 multipart upload
支持 access control
支持 versioning
支持 lifecycle policies

非功能需求

极高 durability
高可用
可扩展存储容量
高写入吞吐
高效 range reads
长期存储成本可控
强安全和访问控制

👉 面试回答

S3-like storage system 是一个 object storage system。

它将数据作为 objects 存储在 buckets 中，每个 object 通过 key 标识。

核心挑战包括 durability、availability、可扩展 metadata 管理、大文件上传、 replication、access control 和低成本长期存储。

3️⃣ 核心概念

Bucket

Bucket 是 object 的 namespace。

示例：

bucket = user-photos

Object

Object 包含：

Object data
Object key
Metadata
Version ID
Checksum
Size
Storage class

Object Key

示例：

photos/2026/05/image.jpg

注意：

Object key 看起来像路径
但 object storage 不是传统文件系统
Prefix 用于 list 和组织对象

👉 面试回答

我会围绕 bucket 和 object 建模。

Bucket 提供 namespace，每个 object 通过 key 标识。

Key 可能看起来像文件路径，但系统内部通常是 flat key-value object store，不是层级文件系统。

4️⃣ 主要 API

Create Bucket

PUT /buckets/{bucketName}

Upload Object

PUT /buckets/{bucketName}/objects/{objectKey}

Request body:

binary file content

Headers:

Content-Type: image/jpeg
Content-Length: 1048576
x-checksum-sha256: abc123

Download Object

GET /buckets/{bucketName}/objects/{objectKey}

Delete Object

DELETE /buckets/{bucketName}/objects/{objectKey}

List Objects

GET /buckets/{bucketName}/objects?prefix=photos/2026/&limit=1000&cursor=xxx

Multipart Upload

POST /buckets/{bucketName}/objects/{objectKey}/multipart
PUT /multipart/{uploadId}/part/{partNumber}
POST /multipart/{uploadId}/complete

👉 面试回答

核心 API 包括 create bucket、upload object、 download object、delete object 和按 prefix list objects。

对于大文件，我会支持 multipart upload，让 client 可以独立上传多个 parts，最后再 complete upload。

5️⃣ 数据模型

Bucket Metadata Table

bucket (
  bucket_id VARCHAR PRIMARY KEY,
  bucket_name VARCHAR UNIQUE,
  owner_id VARCHAR,
  region VARCHAR,
  created_at TIMESTAMP,
  versioning_enabled BOOLEAN,
  lifecycle_policy JSON
)

Object Metadata Table

object_metadata (
  bucket_id VARCHAR,
  object_key VARCHAR,
  version_id VARCHAR,
  size BIGINT,
  content_type VARCHAR,
  checksum VARCHAR,
  storage_class VARCHAR,
  status VARCHAR,
  created_at TIMESTAMP,
  updated_at TIMESTAMP,
  physical_location JSON,
  user_metadata JSON,
  PRIMARY KEY (bucket_id, object_key, version_id)
)

Object Chunk Table

object_chunk (
  object_id VARCHAR,
  chunk_index INT,
  chunk_id VARCHAR,
  size BIGINT,
  checksum VARCHAR,
  storage_nodes ARRAY,
  PRIMARY KEY (object_id, chunk_index)
)

Multipart Upload Table

multipart_upload (
  upload_id VARCHAR PRIMARY KEY,
  bucket_id VARCHAR,
  object_key VARCHAR,
  owner_id VARCHAR,
  status VARCHAR,
  created_at TIMESTAMP
)

👉 面试回答

我会将 object metadata 和 object data 分开。

Metadata 存储 object key、size、checksum、 version、permissions 和 physical location。

真正的 object bytes 会存储在 distributed storage nodes 中，对于大 object，通常会拆成 chunks。

6️⃣ High-Level Architecture

Client
→ API Gateway
→ Auth Service
→ Metadata Service
→ Storage Coordinator
→ Storage Nodes
→ Replication Service
→ Background Repair / Lifecycle Workers

Main Components

API Gateway

Request routing
Authentication
Rate limiting
TLS termination

Metadata Service

Bucket metadata
Object metadata
Version metadata
Prefix listing
Access policy references

Storage Coordinator

Chooses storage nodes
Splits large objects into chunks
Coordinates replication
Verifies checksums

Storage Nodes

Store object chunks
Serve reads
Report health

Background Workers

Replication repair
Garbage collection
Lifecycle transitions
Expired object deletion

👉 面试回答

我会将 metadata services 和 storage nodes 分离。

Metadata service 负责 bucket 和 object metadata， storage nodes 负责存储实际 bytes。

Storage coordinator 负责 chunk placement、 replication、checksum verification，以及读取时的 object assembly。

7️⃣ Upload Flow

Basic Upload Flow

Client uploads object
→ API Gateway authenticates request
→ Metadata Service validates bucket and permissions
→ Storage Coordinator selects storage nodes
→ Object data written to storage nodes
→ Checksums verified
→ Object metadata committed
→ Response returned to client

Important Design Choice

只有数据安全写入后才 commit metadata。

原因：

防止 metadata 指向不存在的数据
确保 upload 成功后 object 可读
提升 durability semantics

👉 面试回答

在 upload 过程中，我会先验证请求身份和 bucket 权限。

然后 storage coordinator 将 object data 写入 storage nodes，并验证 checksums。

只有数据已经 durable stored 之后， metadata service 才会 commit object metadata。

这样可以避免 metadata 写入成功，但实际数据不存在的问题。

8️⃣ Download Flow

Basic Download Flow

Client requests object
→ API Gateway authenticates request
→ Metadata Service fetches object metadata
→ Storage Coordinator locates chunks
→ Storage nodes return data
→ Data streamed back to client

Range Read

支持：

Range: bytes=0-1048575

用于：

Video streaming
Resume download
Large file access
Partial reads

👉 面试回答

对于 download，系统先检查权限并获取 object metadata。

然后定位实际 chunks，从 storage nodes 读取数据，并流式返回给 client。

我也会支持 range reads，因为它对大文件、视频播放和断点续传都很重要。

9️⃣ Multipart Upload

为什么需要？

大文件很难通过一个 request 上传完成。

问题：

Network failure
Timeout
Retry cost
并行度差

Multipart Flow

Initiate multipart upload
→ Upload parts independently
→ Store each part with checksum
→ Complete multipart upload
→ Assemble metadata
→ Make object visible

Benefits

并行上传
失败 part 可单独重试
更高吞吐
更低 retry 成本

Completion Rule

Object 在 complete upload 成功前不应该可见。

👉 面试回答

Multipart upload 允许 client 将大文件拆成多个 parts，并独立上传每个 part。

这样可以提高吞吐，支持单个 part 的重试，避免失败后重新上传整个文件。

Object 只有在 complete 操作成功后才会对外可见。

🔟 Partitioning and Placement

Metadata Partitioning

Metadata 可以按以下方式分片：

bucket_id + object_key hash

或者：

bucket_id + prefix

Object Data Placement

Storage coordinator 根据以下因素选择 nodes：

Available capacity
Node health
Rack / availability zone
Region
Replication factor
Load balancing

Avoiding Hot Prefixes

如果大量 objects 使用相同 prefix：

logs/2026/05/02/...

prefix-based partition 可能成为热点。

策略：

Hash object key
使用 virtual partitions
Split hot partitions
对高写入 workload 添加 random prefix

👉 面试回答

Metadata partitioning 非常关键，因为 object lookup 和 listing 都依赖 metadata。

我会按 bucket 和 object key hash 对 metadata 分片，同时支持高效 prefix listing。

对于 object data placement，我会将 replicas 分布到不同 nodes、racks 或 availability zones，以提升 durability。

1️⃣1️⃣ Replication and Durability

Replication

示例：

replication factor = 3

每个 object chunk 存在多个 storage nodes 上。

Placement Rule

Replicas 应该分布在：

Different disks
Different nodes
Different racks
Different availability zones

Alternative: Erasure Coding

不使用完整副本，而是：

split data into k data blocks + m parity blocks

示例：

10 data blocks + 4 parity blocks

可以在更低存储成本下容忍多个 failure。

Replication vs Erasure Coding

Strategy	优点	缺点
Replication	简单，读快	存储成本高
Erasure coding	存储成本低	更复杂，恢复更慢

👉 面试回答

为了 durability，我会将 object chunks 存储多个副本，并分布在不同 failure domains。

Replication 更简单，恢复速度快，但存储成本更高。

对于大型冷数据， erasure coding 可以在保证高 durability 的同时降低存储成本。

1️⃣2️⃣ Consistency Model

需要较强一致性的场景

新上传 object 的 read-after-write
Delete correctness
Permission checks
Bucket metadata
Object version metadata

可以最终一致的场景

Cross-region replication
Lifecycle transitions
Storage class changes
Analytics and inventory reports

Versioning

如果启用：

same object_key can have multiple version_id values

好处：

恢复误删 objects
防止意外 overwrite
支持 auditability

👉 面试回答

我会尽量在同一 region 内支持新 object 的 read-after-write consistency。

Object metadata 需要谨慎保证一致性，因为 client 期望上传成功后 object 可以立即读取。

Cross-region replication 和 lifecycle transitions 可以最终一致。

Versioning 可以保护用户免受误删或误覆盖的影响。

1️⃣3️⃣ Listing Objects

API

GET /bucket/photos?prefix=2026/&limit=1000&cursor=xxx

Challenges

一个 bucket 可能有数十亿 objects
Prefix listing 可能很昂贵
结果需要 pagination
新写入 object 是否立刻出现在 list 里取决于 consistency model

Strategies

Object metadata 按 bucket + key 排序存储
使用 cursor-based pagination
维护 prefix indexes
避免扫描整个 bucket

👉 面试回答

Listing objects 是 metadata query，不是 storage-node data query。

因为 bucket 可能包含数十亿 objects， listing 必须分页，并且应该使用按 bucket 和 object key 排序的 metadata index。

Prefix listing 不应该扫描整个 bucket。

1️⃣4️⃣ Security and Access Control

Access Control Options

Bucket policy
Object ACL
IAM-style permissions
Pre-signed URLs
Temporary credentials

Pre-signed URL

允许临时访问：

GET object allowed until expiration time

适用于：

Browser direct upload
Temporary file sharing
减少 application server load

Encryption

支持：

TLS in transit
Server-side encryption
Client-side encryption
Key management service integration

👉 面试回答

Object storage 的安全非常关键。

我会支持 bucket policies、IAM-style permissions、 object-level access control，以及用于临时访问的 pre-signed URLs。

数据应该支持传输中和静态加密，并集成 key management service。

1️⃣5️⃣ Lifecycle and Storage Classes

Storage Classes

示例：

standard
infrequent_access
archive
deep_archive

Lifecycle Rules

示例：

Move objects older than 30 days to infrequent access
Move objects older than 180 days to archive
Delete temporary files after 7 days

Background Workers

Lifecycle workers：

Scan metadata
Find eligible objects
Move objects between storage classes
Delete expired objects
Clean incomplete multipart uploads

👉 面试回答

为了控制成本，我会支持 lifecycle policies 和 storage classes。

频繁访问的 objects 保存在 standard storage，旧的或很少访问的 objects 可以迁移到更便宜的 archive storage。

Background workers 会异步执行 lifecycle rules。

1️⃣6️⃣ CDN and Edge Caching

为什么需要 CDN？

Object storage 经常用于服务静态内容。

例如：

Images
Videos
Downloads
Public assets

Flow

Client
→ CDN Edge
→ Object Storage Origin

Benefits

降低延迟
减少 origin load
提升全球访问性能
降低带宽成本

👉 面试回答

对于 public 或频繁访问的 objects，我会将 object storage 和 CDN 集成。

CDN 可以将 objects 缓存在靠近用户的 edge，降低下载延迟并减少 origin load。

对于 private 或频繁更新的内容，可能需要 cache invalidation 和 signed URLs。

1️⃣7️⃣ Failure Handling

常见故障

Storage node failure
Disk failure
Metadata service unavailable
Partial upload failure
Replication lag
Checksum mismatch
Hot partition
Region outage

Strategies

Replication across failure domains
Erasure coding
Checksum validation
Background repair
Retry failed parts
Metadata quorum writes
Failover to replicas
Cross-region replication for disaster recovery

👉 面试回答

Object storage 必须假设 disk 和 node 会失败。

我会使用 replication 或 erasure coding、 checksum validation、background repair，并将数据放置在不同 failure domains。

对于大文件上传， multipart upload 允许失败 part 单独重试。

对于 disaster recovery，可以使用 cross-region replication。

1️⃣8️⃣ Observability

Key Metrics

Upload latency
Download latency
Error rate
Metadata query latency
Storage node health
Replication lag
Disk utilization
Object count
Bucket size
Failed checksum count
Lifecycle job lag
Hot partition count

Important Dashboards

Storage capacity
API latency
Metadata service health
Replication health
Node failures
Lifecycle progress
Cost by bucket / tenant

👉 面试回答

Object storage 的可观测性非常重要。

我会监控 upload 和 download latency、 storage node health、replication lag、 disk utilization、metadata query latency、 checksum failures 和 lifecycle job progress。

这些指标可以帮助发现 durability 风险、性能问题和成本增长。

1️⃣9️⃣ End-to-End Flow

Upload Flow

Client uploads object
→ Authenticate request
→ Validate bucket and permissions
→ Storage coordinator chooses nodes
→ Write object chunks
→ Verify checksum
→ Replicate chunks
→ Commit object metadata
→ Return success

Download Flow

Client requests object
→ Authenticate request
→ Fetch object metadata
→ Locate chunks
→ Read from storage nodes
→ Stream object back to client

Multipart Upload Flow

Initiate upload
→ Upload parts in parallel
→ Store each part
→ Complete upload
→ Commit final object metadata
→ Make object visible

Lifecycle Flow

Lifecycle worker scans metadata
→ Finds eligible objects
→ Moves object to cheaper storage class
→ Updates metadata
→ Deletes expired objects

Key Insight

S3-like storage 不是传统 file system，而是高耐久的 distributed object store。

🧠 Staff-Level Answer（最终版）

👉 面试回答（完整背诵版）

在设计 S3-like file storage system 时，我会把它看作一个 distributed object storage system。

核心抽象是 bucket、object 和 key。 Bucket 是 namespace， object 包含 data、metadata、checksum、 version 和 storage class。

我会将 metadata 和 object data 分离。 Metadata 存储在一致性更强的 metadata service 中， object bytes 则分布在多个 storage nodes 上。

在 upload 过程中，系统会认证请求，校验 bucket 权限，将 object 拆成 chunks 写入 storage nodes，验证 checksum，复制数据，然后才 commit metadata。

对于 download，系统读取 metadata，定位 object chunks，并从 storage nodes 流式返回数据。

对于大文件，我会支持 multipart upload，让 client 独立上传多个 parts，并在 part 失败时只重试失败部分，不需要重新上传整个文件。

为了 durability， object chunks 应该跨 failure domains 复制，或者使用 erasure coding 以更低成本获得高 durability。

Metadata 需要支持高效 object lookup 和 prefix listing，通常通过 pagination 和 cursor-based listing 实现。

在安全方面，我会支持 IAM-style permissions、bucket policies、 object ACLs、pre-signed URLs、传输和静态加密，以及 key management service 集成。

Lifecycle policies 和 storage classes 可以通过将旧数据或低频访问数据迁移到便宜存储来控制成本。

核心权衡包括 durability、availability、 consistency、storage cost、latency 和 metadata scalability。

最终目标是以高耐久、低成本的方式存储海量 object data，同时提供可扩展访问、强安全性和稳定性能。

⭐ Final Insight

S3-like File Storage 的核心不是传统文件系统，而是一个以 bucket 和 object 为抽象的高耐久分布式对象存储系统。