System Design Deep Dive - 11 Design File Storage (S3-like)

Post by ailswan May. 04, 2026

中文 ↓

🎯 Design File Storage (S3-like)

1️⃣ Core Framework

When discussing S3-like File Storage design, I frame it as:

  1. Core object storage model: bucket, object, key
  2. Upload, download, delete, and list flows
  3. Metadata and object data separation
  4. Partitioning and replication
  5. Durability and availability
  6. Large file handling and multipart upload
  7. Consistency, versioning, and lifecycle policies
  8. Security, access control, and cost control

2️⃣ Core Requirements


Functional Requirements


Non-functional Requirements


👉 Interview Answer

An S3-like storage system is an object storage system.

It stores data as objects inside buckets, where each object is identified by a key.

The main challenges are durability, availability, scalable metadata management, large file upload, replication, access control, and cost-efficient storage.


3️⃣ Core Concepts


Bucket

A bucket is a namespace for objects.

Example:

bucket = user-photos

Object

An object contains:


Object Key

Example:

photos/2026/05/image.jpg

Important:


👉 Interview Answer

I would model the system around buckets and objects.

A bucket provides a namespace, and each object is identified by a key.

The key may look like a file path, but internally the system is usually a flat key-value object store, not a hierarchical file system.


4️⃣ Main APIs


Create Bucket

PUT /buckets/{bucketName}

Upload Object

PUT /buckets/{bucketName}/objects/{objectKey}

Request body:

binary file content

Headers:

Content-Type: image/jpeg
Content-Length: 1048576
x-checksum-sha256: abc123

Download Object

GET /buckets/{bucketName}/objects/{objectKey}

Delete Object

DELETE /buckets/{bucketName}/objects/{objectKey}

List Objects

GET /buckets/{bucketName}/objects?prefix=photos/2026/&limit=1000&cursor=xxx

Multipart Upload

POST /buckets/{bucketName}/objects/{objectKey}/multipart
PUT /multipart/{uploadId}/part/{partNumber}
POST /multipart/{uploadId}/complete

👉 Interview Answer

The core APIs are create bucket, upload object, download object, delete object, and list objects by prefix.

For large files, I would support multipart upload, where the client uploads file parts independently and then completes the upload once all parts are stored.


5️⃣ Data Model


Bucket Metadata Table

bucket (
  bucket_id VARCHAR PRIMARY KEY,
  bucket_name VARCHAR UNIQUE,
  owner_id VARCHAR,
  region VARCHAR,
  created_at TIMESTAMP,
  versioning_enabled BOOLEAN,
  lifecycle_policy JSON
)

Object Metadata Table

object_metadata (
  bucket_id VARCHAR,
  object_key VARCHAR,
  version_id VARCHAR,
  size BIGINT,
  content_type VARCHAR,
  checksum VARCHAR,
  storage_class VARCHAR,
  status VARCHAR,
  created_at TIMESTAMP,
  updated_at TIMESTAMP,
  physical_location JSON,
  user_metadata JSON,
  PRIMARY KEY (bucket_id, object_key, version_id)
)

Object Chunk Table

object_chunk (
  object_id VARCHAR,
  chunk_index INT,
  chunk_id VARCHAR,
  size BIGINT,
  checksum VARCHAR,
  storage_nodes ARRAY,
  PRIMARY KEY (object_id, chunk_index)
)

Multipart Upload Table

multipart_upload (
  upload_id VARCHAR PRIMARY KEY,
  bucket_id VARCHAR,
  object_key VARCHAR,
  owner_id VARCHAR,
  status VARCHAR,
  created_at TIMESTAMP
)

👉 Interview Answer

I would separate object metadata from object data.

Metadata stores object key, size, checksum, version, permissions, and physical location.

The actual object bytes are stored in distributed storage nodes, often split into chunks for large objects.


6️⃣ High-Level Architecture


Client
→ API Gateway
→ Auth Service
→ Metadata Service
→ Storage Coordinator
→ Storage Nodes
→ Replication Service
→ Background Repair / Lifecycle Workers

Main Components

API Gateway


Metadata Service


Storage Coordinator


Storage Nodes


Background Workers


👉 Interview Answer

I would separate metadata services from storage nodes.

Metadata service handles bucket and object metadata, while storage nodes store the actual bytes.

A storage coordinator manages chunk placement, replication, checksum verification, and object assembly during reads.


7️⃣ Upload Flow


Basic Upload Flow

Client uploads object
→ API Gateway authenticates request
→ Metadata Service validates bucket and permissions
→ Storage Coordinator selects storage nodes
→ Object data written to storage nodes
→ Checksums verified
→ Object metadata committed
→ Response returned to client

Important Design Choice

Commit metadata only after data is safely stored.

Why?


👉 Interview Answer

During upload, I would first authenticate the request and validate bucket permissions.

Then the storage coordinator writes object data to storage nodes and verifies checksums.

Only after the data is durably stored would the metadata service commit the object metadata.

This prevents successful metadata writes from pointing to missing data.


8️⃣ Download Flow


Basic Download Flow

Client requests object
→ API Gateway authenticates request
→ Metadata Service fetches object metadata
→ Storage Coordinator locates chunks
→ Storage nodes return data
→ Data streamed back to client

Range Read

Support:

Range: bytes=0-1048575

Used for:


👉 Interview Answer

For downloads, the system first checks permissions and fetches object metadata.

Then it locates the physical chunks and streams data from storage nodes back to the client.

I would also support range reads, which are important for large files, media streaming, and resumable downloads.


9️⃣ Multipart Upload


Why Needed?

Large files are hard to upload in one request.

Problems:


Multipart Flow

Initiate multipart upload
→ Upload parts independently
→ Store each part with checksum
→ Complete multipart upload
→ Assemble metadata
→ Make object visible

Benefits


Completion Rule

Object should not become visible until upload is completed.


👉 Interview Answer

Multipart upload allows clients to split a large file into parts and upload those parts independently.

This improves throughput, supports retries for individual parts, and avoids restarting the entire upload after a failure.

The object only becomes visible after the complete operation succeeds.


🔟 Partitioning and Placement


Metadata Partitioning

Partition metadata by:

bucket_id + object_key hash

or:

bucket_id + prefix

Object Data Placement

Storage coordinator chooses nodes based on:


Avoiding Hot Prefixes

If many objects share the same prefix:

logs/2026/05/02/...

a prefix-based partition may become hot.

Strategies:


👉 Interview Answer

Metadata partitioning is critical because listing and object lookup both depend on metadata.

I would partition object metadata by bucket and object key hash, while also supporting efficient prefix listing.

For object data placement, I would distribute replicas across different nodes, racks, or availability zones to improve durability.


1️⃣1️⃣ Replication and Durability


Replication

Example:

replication factor = 3

Each object chunk is stored on multiple storage nodes.


Placement Rule

Replicas should be placed across:


Alternative: Erasure Coding

Instead of full replication:

split data into k data blocks + m parity blocks

Example:

10 data blocks + 4 parity blocks

Can tolerate several failures with lower storage overhead.


Replication vs Erasure Coding

Strategy Pros Cons
Replication Simple, fast reads Higher storage cost
Erasure coding Lower storage cost More complex, slower recovery

👉 Interview Answer

For durability, I would store multiple copies of object chunks across different failure domains.

Replication is simpler and provides fast recovery, but it costs more storage.

For large cold objects, erasure coding can reduce storage cost while still providing high durability.


1️⃣2️⃣ Consistency Model


Stronger Consistency Needed For


Eventual Consistency Acceptable For


Versioning

If enabled:

same object_key can have multiple version_id values

Benefits:


👉 Interview Answer

I would aim for read-after-write consistency for newly uploaded objects in the same region.

Object metadata needs careful consistency, because clients expect an uploaded object to be readable after success.

Cross-region replication and lifecycle transitions can be eventually consistent.

Versioning can help protect against accidental deletes or overwrites.


1️⃣3️⃣ Listing Objects


API

GET /bucket/photos?prefix=2026/&limit=1000&cursor=xxx

Challenges


Strategies


👉 Interview Answer

Listing objects is a metadata query, not a storage-node data query.

Since buckets can contain billions of objects, listing must be paginated and should use metadata indexes sorted by bucket and object key.

Prefix listing should avoid scanning the entire bucket.


1️⃣4️⃣ Security and Access Control


Access Control Options


Pre-signed URL

Allows temporary access:

GET object allowed until expiration time

Useful for:


Encryption

Support:


👉 Interview Answer

Security is critical for object storage.

I would support bucket policies, IAM-style permissions, object-level access control, and pre-signed URLs for temporary access.

Data should be encrypted in transit and at rest, with integration to a key management service.


1️⃣5️⃣ Lifecycle and Storage Classes


Storage Classes

Examples:

standard
infrequent_access
archive
deep_archive

Lifecycle Rules

Examples:

Move objects older than 30 days to infrequent access
Move objects older than 180 days to archive
Delete temporary files after 7 days

Background Workers

Lifecycle workers:


👉 Interview Answer

To control cost, I would support lifecycle policies and storage classes.

Frequently accessed objects stay in standard storage, while older or rarely accessed objects can move to cheaper archive storage.

Background workers enforce lifecycle rules asynchronously.


1️⃣6️⃣ CDN and Edge Caching


Why CDN?

Object storage often serves static content.

Examples:


Flow

Client
→ CDN Edge
→ Object Storage Origin

Benefits


👉 Interview Answer

For public or frequently accessed objects, I would integrate object storage with a CDN.

CDN caches objects close to users, reducing download latency and origin load.

Cache invalidation and signed URLs may be needed for private or frequently updated content.


1️⃣7️⃣ Failure Handling


Common Failures


Strategies


👉 Interview Answer

Object storage must assume disks and nodes will fail.

I would use replication or erasure coding, checksum validation, background repair, and placement across failure domains.

For large uploads, multipart upload allows failed parts to be retried independently.

For disaster recovery, cross-region replication can be used.


1️⃣8️⃣ Observability


Key Metrics


Important Dashboards


👉 Interview Answer

Observability is essential for object storage.

I would monitor upload and download latency, storage node health, replication lag, disk utilization, metadata query latency, checksum failures, and lifecycle job progress.

These metrics help detect durability risks, performance issues, and cost growth.


1️⃣9️⃣ End-to-End Flow


Upload Flow

Client uploads object
→ Authenticate request
→ Validate bucket and permissions
→ Storage coordinator chooses nodes
→ Write object chunks
→ Verify checksum
→ Replicate chunks
→ Commit object metadata
→ Return success

Download Flow

Client requests object
→ Authenticate request
→ Fetch object metadata
→ Locate chunks
→ Read from storage nodes
→ Stream object back to client

Multipart Upload Flow

Initiate upload
→ Upload parts in parallel
→ Store each part
→ Complete upload
→ Commit final object metadata
→ Make object visible

Lifecycle Flow

Lifecycle worker scans metadata
→ Finds eligible objects
→ Moves object to cheaper storage class
→ Updates metadata
→ Deletes expired objects

Key Insight

S3-like storage is not a file system — it is a highly durable, distributed object store.


🧠 Staff-Level Answer (Final)


👉 Interview Answer (Full Version)

When designing an S3-like file storage system, I think of it as a distributed object storage system.

The core abstraction is bucket, object, and key. A bucket is a namespace, and an object contains data, metadata, checksum, version, and storage class.

I would separate metadata from object data. Metadata is stored in a strongly consistent metadata service, while object bytes are stored across distributed storage nodes.

During upload, the system authenticates the request, validates bucket permissions, writes object chunks to storage nodes, verifies checksums, replicates the data, and only then commits metadata.

For downloads, the system reads metadata, locates object chunks, and streams data from storage nodes.

For large files, I would support multipart upload, allowing clients to upload parts independently and retry failed parts without restarting the whole upload.

For durability, object chunks should be replicated across failure domains, or stored using erasure coding for cost-efficient durability.

Metadata should support efficient object lookup and prefix listing, usually with pagination and cursor-based listing.

For security, I would support IAM-style permissions, bucket policies, object ACLs, pre-signed URLs, encryption at rest and in transit, and integration with a key management service.

Lifecycle policies and storage classes help control cost by moving old or rarely accessed objects to cheaper storage.

The main trade-offs are durability, availability, consistency, storage cost, latency, and metadata scalability.

Ultimately, the goal is to store massive amounts of object data durably and cost-effectively, while providing scalable access, strong security, and predictable performance.


⭐ Final Insight

S3-like File Storage 的核心不是传统文件系统, 而是一个以 bucket 和 object 为抽象的高耐久分布式对象存储系统。



中文部分


🎯 Design File Storage (S3-like)


1️⃣ 核心框架

在设计 S3-like File Storage 时,我通常从以下几个方面来分析:

  1. 核心对象存储模型:bucket、object、key
  2. 上传、下载、删除和 list 流程
  3. Metadata 和 object data 分离
  4. 分片和副本
  5. Durability 和 availability
  6. 大文件处理和 multipart upload
  7. Consistency、versioning 和 lifecycle policies
  8. Security、access control 和 cost control

2️⃣ 核心需求


功能需求


非功能需求


👉 面试回答

S3-like storage system 是一个 object storage system。

它将数据作为 objects 存储在 buckets 中, 每个 object 通过 key 标识。

核心挑战包括 durability、availability、 可扩展 metadata 管理、大文件上传、 replication、access control 和低成本长期存储。


3️⃣ 核心概念


Bucket

Bucket 是 object 的 namespace。

示例:

bucket = user-photos

Object

Object 包含:


Object Key

示例:

photos/2026/05/image.jpg

注意:


👉 面试回答

我会围绕 bucket 和 object 建模。

Bucket 提供 namespace, 每个 object 通过 key 标识。

Key 可能看起来像文件路径, 但系统内部通常是 flat key-value object store, 不是层级文件系统。


4️⃣ 主要 API


Create Bucket

PUT /buckets/{bucketName}

Upload Object

PUT /buckets/{bucketName}/objects/{objectKey}

Request body:

binary file content

Headers:

Content-Type: image/jpeg
Content-Length: 1048576
x-checksum-sha256: abc123

Download Object

GET /buckets/{bucketName}/objects/{objectKey}

Delete Object

DELETE /buckets/{bucketName}/objects/{objectKey}

List Objects

GET /buckets/{bucketName}/objects?prefix=photos/2026/&limit=1000&cursor=xxx

Multipart Upload

POST /buckets/{bucketName}/objects/{objectKey}/multipart
PUT /multipart/{uploadId}/part/{partNumber}
POST /multipart/{uploadId}/complete

👉 面试回答

核心 API 包括 create bucket、upload object、 download object、delete object 和按 prefix list objects。

对于大文件, 我会支持 multipart upload, 让 client 可以独立上传多个 parts, 最后再 complete upload。


5️⃣ 数据模型


Bucket Metadata Table

bucket (
  bucket_id VARCHAR PRIMARY KEY,
  bucket_name VARCHAR UNIQUE,
  owner_id VARCHAR,
  region VARCHAR,
  created_at TIMESTAMP,
  versioning_enabled BOOLEAN,
  lifecycle_policy JSON
)

Object Metadata Table

object_metadata (
  bucket_id VARCHAR,
  object_key VARCHAR,
  version_id VARCHAR,
  size BIGINT,
  content_type VARCHAR,
  checksum VARCHAR,
  storage_class VARCHAR,
  status VARCHAR,
  created_at TIMESTAMP,
  updated_at TIMESTAMP,
  physical_location JSON,
  user_metadata JSON,
  PRIMARY KEY (bucket_id, object_key, version_id)
)

Object Chunk Table

object_chunk (
  object_id VARCHAR,
  chunk_index INT,
  chunk_id VARCHAR,
  size BIGINT,
  checksum VARCHAR,
  storage_nodes ARRAY,
  PRIMARY KEY (object_id, chunk_index)
)

Multipart Upload Table

multipart_upload (
  upload_id VARCHAR PRIMARY KEY,
  bucket_id VARCHAR,
  object_key VARCHAR,
  owner_id VARCHAR,
  status VARCHAR,
  created_at TIMESTAMP
)

👉 面试回答

我会将 object metadata 和 object data 分开。

Metadata 存储 object key、size、checksum、 version、permissions 和 physical location。

真正的 object bytes 会存储在 distributed storage nodes 中, 对于大 object,通常会拆成 chunks。


6️⃣ High-Level Architecture


Client
→ API Gateway
→ Auth Service
→ Metadata Service
→ Storage Coordinator
→ Storage Nodes
→ Replication Service
→ Background Repair / Lifecycle Workers

Main Components

API Gateway


Metadata Service


Storage Coordinator


Storage Nodes


Background Workers


👉 面试回答

我会将 metadata services 和 storage nodes 分离。

Metadata service 负责 bucket 和 object metadata, storage nodes 负责存储实际 bytes。

Storage coordinator 负责 chunk placement、 replication、checksum verification, 以及读取时的 object assembly。


7️⃣ Upload Flow


Basic Upload Flow

Client uploads object
→ API Gateway authenticates request
→ Metadata Service validates bucket and permissions
→ Storage Coordinator selects storage nodes
→ Object data written to storage nodes
→ Checksums verified
→ Object metadata committed
→ Response returned to client

Important Design Choice

只有数据安全写入后才 commit metadata。

原因:


👉 面试回答

在 upload 过程中, 我会先验证请求身份和 bucket 权限。

然后 storage coordinator 将 object data 写入 storage nodes, 并验证 checksums。

只有数据已经 durable stored 之后, metadata service 才会 commit object metadata。

这样可以避免 metadata 写入成功, 但实际数据不存在的问题。


8️⃣ Download Flow


Basic Download Flow

Client requests object
→ API Gateway authenticates request
→ Metadata Service fetches object metadata
→ Storage Coordinator locates chunks
→ Storage nodes return data
→ Data streamed back to client

Range Read

支持:

Range: bytes=0-1048575

用于:


👉 面试回答

对于 download, 系统先检查权限并获取 object metadata。

然后定位实际 chunks, 从 storage nodes 读取数据, 并流式返回给 client。

我也会支持 range reads, 因为它对大文件、视频播放 和断点续传都很重要。


9️⃣ Multipart Upload


为什么需要?

大文件很难通过一个 request 上传完成。

问题:


Multipart Flow

Initiate multipart upload
→ Upload parts independently
→ Store each part with checksum
→ Complete multipart upload
→ Assemble metadata
→ Make object visible

Benefits


Completion Rule

Object 在 complete upload 成功前不应该可见。


👉 面试回答

Multipart upload 允许 client 将大文件拆成多个 parts, 并独立上传每个 part。

这样可以提高吞吐, 支持单个 part 的重试, 避免失败后重新上传整个文件。

Object 只有在 complete 操作成功后才会对外可见。


🔟 Partitioning and Placement


Metadata Partitioning

Metadata 可以按以下方式分片:

bucket_id + object_key hash

或者:

bucket_id + prefix

Object Data Placement

Storage coordinator 根据以下因素选择 nodes:


Avoiding Hot Prefixes

如果大量 objects 使用相同 prefix:

logs/2026/05/02/...

prefix-based partition 可能成为热点。

策略:


👉 面试回答

Metadata partitioning 非常关键, 因为 object lookup 和 listing 都依赖 metadata。

我会按 bucket 和 object key hash 对 metadata 分片, 同时支持高效 prefix listing。

对于 object data placement, 我会将 replicas 分布到不同 nodes、racks 或 availability zones,以提升 durability。


1️⃣1️⃣ Replication and Durability


Replication

示例:

replication factor = 3

每个 object chunk 存在多个 storage nodes 上。


Placement Rule

Replicas 应该分布在:


Alternative: Erasure Coding

不使用完整副本,而是:

split data into k data blocks + m parity blocks

示例:

10 data blocks + 4 parity blocks

可以在更低存储成本下容忍多个 failure。


Replication vs Erasure Coding

Strategy 优点 缺点
Replication 简单,读快 存储成本高
Erasure coding 存储成本低 更复杂,恢复更慢

👉 面试回答

为了 durability, 我会将 object chunks 存储多个副本, 并分布在不同 failure domains。

Replication 更简单,恢复速度快, 但存储成本更高。

对于大型冷数据, erasure coding 可以在保证高 durability 的同时 降低存储成本。


1️⃣2️⃣ Consistency Model


需要较强一致性的场景


可以最终一致的场景


Versioning

如果启用:

same object_key can have multiple version_id values

好处:


👉 面试回答

我会尽量在同一 region 内支持新 object 的 read-after-write consistency。

Object metadata 需要谨慎保证一致性, 因为 client 期望上传成功后 object 可以立即读取。

Cross-region replication 和 lifecycle transitions 可以最终一致。

Versioning 可以保护用户免受误删或误覆盖的影响。


1️⃣3️⃣ Listing Objects


API

GET /bucket/photos?prefix=2026/&limit=1000&cursor=xxx

Challenges


Strategies


👉 面试回答

Listing objects 是 metadata query, 不是 storage-node data query。

因为 bucket 可能包含数十亿 objects, listing 必须分页, 并且应该使用按 bucket 和 object key 排序的 metadata index。

Prefix listing 不应该扫描整个 bucket。


1️⃣4️⃣ Security and Access Control


Access Control Options


Pre-signed URL

允许临时访问:

GET object allowed until expiration time

适用于:


Encryption

支持:


👉 面试回答

Object storage 的安全非常关键。

我会支持 bucket policies、IAM-style permissions、 object-level access control, 以及用于临时访问的 pre-signed URLs。

数据应该支持传输中和静态加密, 并集成 key management service。


1️⃣5️⃣ Lifecycle and Storage Classes


Storage Classes

示例:

standard
infrequent_access
archive
deep_archive

Lifecycle Rules

示例:

Move objects older than 30 days to infrequent access
Move objects older than 180 days to archive
Delete temporary files after 7 days

Background Workers

Lifecycle workers:


👉 面试回答

为了控制成本, 我会支持 lifecycle policies 和 storage classes。

频繁访问的 objects 保存在 standard storage, 旧的或很少访问的 objects 可以迁移到更便宜的 archive storage。

Background workers 会异步执行 lifecycle rules。


1️⃣6️⃣ CDN and Edge Caching


为什么需要 CDN?

Object storage 经常用于服务静态内容。

例如:


Flow

Client
→ CDN Edge
→ Object Storage Origin

Benefits


👉 面试回答

对于 public 或频繁访问的 objects, 我会将 object storage 和 CDN 集成。

CDN 可以将 objects 缓存在靠近用户的 edge, 降低下载延迟并减少 origin load。

对于 private 或频繁更新的内容, 可能需要 cache invalidation 和 signed URLs。


1️⃣7️⃣ Failure Handling


常见故障


Strategies


👉 面试回答

Object storage 必须假设 disk 和 node 会失败。

我会使用 replication 或 erasure coding、 checksum validation、background repair, 并将数据放置在不同 failure domains。

对于大文件上传, multipart upload 允许失败 part 单独重试。

对于 disaster recovery, 可以使用 cross-region replication。


1️⃣8️⃣ Observability


Key Metrics


Important Dashboards


👉 面试回答

Object storage 的可观测性非常重要。

我会监控 upload 和 download latency、 storage node health、replication lag、 disk utilization、metadata query latency、 checksum failures 和 lifecycle job progress。

这些指标可以帮助发现 durability 风险、 性能问题和成本增长。


1️⃣9️⃣ End-to-End Flow


Upload Flow

Client uploads object
→ Authenticate request
→ Validate bucket and permissions
→ Storage coordinator chooses nodes
→ Write object chunks
→ Verify checksum
→ Replicate chunks
→ Commit object metadata
→ Return success

Download Flow

Client requests object
→ Authenticate request
→ Fetch object metadata
→ Locate chunks
→ Read from storage nodes
→ Stream object back to client

Multipart Upload Flow

Initiate upload
→ Upload parts in parallel
→ Store each part
→ Complete upload
→ Commit final object metadata
→ Make object visible

Lifecycle Flow

Lifecycle worker scans metadata
→ Finds eligible objects
→ Moves object to cheaper storage class
→ Updates metadata
→ Deletes expired objects

Key Insight

S3-like storage 不是传统 file system, 而是高耐久的 distributed object store。


🧠 Staff-Level Answer(最终版)


👉 面试回答(完整背诵版)

在设计 S3-like file storage system 时, 我会把它看作一个 distributed object storage system。

核心抽象是 bucket、object 和 key。 Bucket 是 namespace, object 包含 data、metadata、checksum、 version 和 storage class。

我会将 metadata 和 object data 分离。 Metadata 存储在一致性更强的 metadata service 中, object bytes 则分布在多个 storage nodes 上。

在 upload 过程中, 系统会认证请求, 校验 bucket 权限, 将 object 拆成 chunks 写入 storage nodes, 验证 checksum, 复制数据, 然后才 commit metadata。

对于 download, 系统读取 metadata, 定位 object chunks, 并从 storage nodes 流式返回数据。

对于大文件, 我会支持 multipart upload, 让 client 独立上传多个 parts, 并在 part 失败时只重试失败部分, 不需要重新上传整个文件。

为了 durability, object chunks 应该跨 failure domains 复制, 或者使用 erasure coding 以更低成本获得高 durability。

Metadata 需要支持高效 object lookup 和 prefix listing, 通常通过 pagination 和 cursor-based listing 实现。

在安全方面, 我会支持 IAM-style permissions、bucket policies、 object ACLs、pre-signed URLs、 传输和静态加密, 以及 key management service 集成。

Lifecycle policies 和 storage classes 可以通过将旧数据或低频访问数据迁移到便宜存储 来控制成本。

核心权衡包括 durability、availability、 consistency、storage cost、latency 和 metadata scalability。

最终目标是以高耐久、低成本的方式 存储海量 object data, 同时提供可扩展访问、强安全性和稳定性能。


⭐ Final Insight

S3-like File Storage 的核心不是传统文件系统, 而是一个以 bucket 和 object 为抽象的高耐久分布式对象存储系统。

Implement