·

System Design Deep Dive - 06 How Retrieval Systems Scale to Billion Documents

Post by ailswan May. 24, 2026

中文 ↓

🎯 How Retrieval Systems Scale to Billion Documents


1️⃣ Core Framework

When discussing retrieval systems at billion-document scale, I frame it as:

  1. Why retrieval becomes hard at scale
  2. Indexing architecture
  3. Sharding and partitioning
  4. Distributed query execution
  5. Vector search at scale
  6. Hybrid search and ranking
  7. Freshness and re-indexing
  8. Trade-offs: recall vs latency vs cost

2️⃣ Why Retrieval Becomes Hard at Scale

Searching a few thousand documents is easy.

Searching billions of documents is a distributed systems problem.


Challenges

At billion-document scale, the system must handle:


Core Problem

More documents
→ Larger index
→ More shards
→ More network calls
→ Higher latency
→ Harder ranking

👉 Interview Answer

Retrieval at billion-document scale is not just a search problem.

It is a distributed systems problem involving indexing, sharding, replication, ranking, caching, freshness, access control, and cost control.


3️⃣ High-Level Architecture


Architecture

Documents
→ Ingestion Pipeline
→ Parser / Cleaner
→ Chunker
→ Embedding / Index Builder
→ Distributed Index
→ Query Router
→ Shard Search
→ Result Aggregator
→ Re-ranker
→ Final Results

Two Main Paths

Offline Indexing Path

Documents
→ Parse
→ Chunk
→ Embed
→ Build Index
→ Replicate Index

Online Query Path

Query
→ Query Router
→ Search Relevant Shards
→ Merge Candidates
→ Re-rank
→ Return Results

👉 Interview Answer

Large-scale retrieval systems usually separate offline indexing from online query serving.

The offline path builds and updates distributed indexes.

The online path routes queries to shards, retrieves candidates, merges results, re-ranks them, and returns the final answer or context.


4️⃣ Indexing at Scale


What Is an Index?

An index is a data structure that makes search fast.

Without an index:

Scan all documents
→ Too slow

With an index:

Query
→ Lookup relevant candidates
→ Return results quickly

Index Types

Index Type Best For
Inverted index Keyword search
Vector index Semantic search
Metadata index Filtering
Graph index Relationship retrieval
Column index Structured filters

Important Point

At scale, indexes must be distributed.


👉 Interview Answer

At billion-document scale, retrieval depends on indexes.

Keyword retrieval uses inverted indexes, vector retrieval uses vector indexes, and metadata filtering uses structured indexes.

These indexes must be distributed across many machines.


5️⃣ Sharding


Why Sharding Is Needed

A single machine cannot store or search billions of documents efficiently.

So the index is split across shards.

Global Index
→ Shard 1
→ Shard 2
→ Shard 3
→ ...

Sharding Strategies

Strategy Description
Hash sharding Distribute documents evenly
Range sharding Partition by ID or timestamp
Tenant sharding Partition by customer or organization
Semantic sharding Group similar content
Hybrid sharding Combine multiple strategies

Trade-off

Hash sharding balances load.

Semantic or tenant sharding improves locality.


👉 Interview Answer

Sharding splits the retrieval index across multiple machines.

Hash sharding gives good load balance, while tenant, range, or semantic sharding can improve locality and filtering.

The right strategy depends on query patterns and access-control requirements.


6️⃣ Query Routing


What Is Query Routing?

Query routing decides which shards should receive a query.


Simple Approach

Query all shards
→ Merge results

Good recall, but expensive.


Smarter Approach

Query router
→ Select likely relevant shards
→ Search subset of shards

Lower latency and cost, but may reduce recall.


Routing Signals


👉 Interview Answer

Query routing controls which shards are searched.

Searching all shards gives higher recall but increases latency and cost.

At scale, systems use routing signals like tenant, metadata, region, language, or semantic clusters to search fewer shards.


7️⃣ Distributed Query Execution


Query Fanout

At scale, one query may be sent to many shards.

Query
→ Shard 1
→ Shard 2
→ Shard 3
→ ...
→ Merge results

Problems


Common Controls


👉 Interview Answer

Large-scale retrieval uses distributed query execution.

The system fans out a query to multiple shards, collects top candidates, merges them, handles timeouts, and may return partial results when some shards are slow or unavailable.


8️⃣ Vector Search at Scale


Why Vector Search Is Hard

Exact nearest-neighbor search over billions of vectors is too expensive.


Solution

Use Approximate Nearest Neighbor search.

Exact search
→ Highest recall
→ Too slow

Approximate search
→ Slightly lower recall
→ Much faster

Common ANN Indexes

Index Strength
HNSW Fast high-recall search
IVF Good for large-scale clustering
PQ Compression and lower memory
ScaNN / DiskANN-style systems Large-scale vector retrieval

Trade-off

Higher recall
→ More compute and latency

Lower latency
→ Possible recall loss

👉 Interview Answer

Billion-scale vector retrieval usually uses approximate nearest-neighbor search.

ANN indexes trade a small amount of recall for much lower latency and cost.

The key tuning problem is balancing recall, latency, memory, and index build cost.


9️⃣ Candidate Generation and Re-ranking


Two-Stage Retrieval

Most large systems do not rank everything deeply.

They use two stages.


Stage 1: Candidate Generation

Retrieve many possible matches quickly.

Search index
→ Top 100 or Top 1000 candidates

Stage 2: Re-ranking

Use stronger ranking model on fewer candidates.

Candidates
→ Re-ranker
→ Top 5 or Top 10 final results

Why This Works

Fast retrieval gives recall.

Re-ranking gives precision.


👉 Interview Answer

Large-scale retrieval usually uses a two-stage design.

The first stage retrieves candidates quickly from distributed indexes.

The second stage uses a stronger re-ranker to improve precision on a smaller candidate set.


🔟 Hybrid Retrieval at Scale


Why Hybrid Retrieval Matters

At billion-document scale, neither keyword nor vector search is enough alone.


Hybrid Design

Query
→ Keyword Search
→ Vector Search
→ Metadata Filtering
→ Merge Candidates
→ Re-rank

Benefits


👉 Interview Answer

Production retrieval systems often use hybrid retrieval.

Keyword search handles exact terms, vector search handles semantic meaning, metadata filters enforce constraints, and re-ranking improves final relevance.


1️⃣1️⃣ Metadata Filtering and Access Control


Why It Is Critical

Enterprise retrieval often needs per-user permissions.


Access Control Problem

User query
→ Retrieval system
→ Must only return allowed documents

Filtering Strategies


Best Practice

Filter as early as possible.


👉 Interview Answer

Access control is critical at scale.

The retrieval system must ensure users only retrieve documents they are allowed to see.

In enterprise systems, permission filtering should happen as early as possible in the retrieval path.


1️⃣2️⃣ Freshness and Re-indexing


Documents Constantly Change

Large-scale systems handle:


Index Update Strategies

Strategy Description
Batch indexing Periodic large updates
Streaming indexing Near real-time updates
Delta indexing Only update changed documents
Blue-green index rebuild Build new index, then swap
Lazy refresh Update on demand

Trade-off

Higher freshness
→ More indexing cost

Lower cost
→ More stale results

👉 Interview Answer

Retrieval systems need a freshness strategy.

Some systems use batch indexing, others use streaming or delta indexing.

The trade-off is freshness versus indexing cost and operational complexity.


1️⃣3️⃣ Caching


Why Caching Matters

Popular queries repeat.

Retrieval can be expensive.


Cache Layers


Cache Risks


Important Rule

Cache keys must include permission and context signals.


👉 Interview Answer

Caching reduces retrieval latency and cost, but cache keys must include user permissions, filters, query version, and freshness requirements.

Otherwise, caching can create stale results or security leaks.


1️⃣4️⃣ Replication and Availability


Why Replication Is Needed

Search systems must remain available when machines fail.


Replication Strategy

Shard 1
→ Replica A
→ Replica B
→ Replica C

Benefits


Trade-off

More replicas increase cost.


👉 Interview Answer

Large retrieval systems replicate shards for availability and read scalability.

Replication improves fault tolerance and reduces tail latency, but increases storage and operational cost.


1️⃣5️⃣ Observability


What to Monitor


Debugging Questions


👉 Interview Answer

Observability is essential for large-scale retrieval.

I would track latency, recall, precision, shard health, index freshness, cache hit rate, permission filtering, and ranking quality.

Without observability, retrieval failures are very hard to debug.


1️⃣6️⃣ Best Practices


Practical Rules


Design Principle

Scale retrieval by reducing the search space,
not by scanning everything faster.

👉 Interview Answer

At billion-document scale, the key is reducing the search space.

Good systems use sharding, routing, metadata filtering, approximate search, candidate generation, re-ranking, caching, and replication to balance recall, latency, and cost.


🧠 Staff-Level Answer Final


👉 Interview Answer Full Version

Scaling retrieval to billions of documents is a distributed systems problem.

The main challenges are storage volume, indexing, sharding, query routing, ranking, freshness, access control, latency, and cost.

A typical architecture separates the offline indexing path from the online query path.

The offline path parses documents, cleans them, chunks them, generates embeddings or keyword indexes, builds distributed indexes, and replicates them.

The online path receives a query, routes it to relevant shards, performs keyword or vector search, merges candidate results, applies metadata and permission filters, re-ranks candidates, and returns the final results.

Sharding is required because one machine cannot store or search billions of documents efficiently.

Hash sharding gives good load balance, while tenant, range, metadata, or semantic sharding can improve locality and access control.

Query routing is important because searching every shard is expensive.

The system should use metadata, tenant, language, region, time range, or semantic cluster signals to reduce fanout.

For vector search, exact nearest-neighbor search is too expensive at this scale, so production systems usually use approximate nearest-neighbor indexes.

ANN trades a small amount of recall for much better latency, memory efficiency, and cost.

Most large systems also use two-stage retrieval: first retrieve a larger set of candidates quickly, then use a stronger re-ranker on a smaller set.

Hybrid retrieval is often best for RAG systems because keyword search captures exact terms, vector search captures semantic similarity, metadata filtering enforces constraints, and re-ranking improves precision.

Freshness is another major challenge.

The system needs batch, streaming, delta, or blue-green indexing strategies to handle new, updated, and deleted documents.

Finally, observability is critical.

We need to track shard health, index freshness, query latency, tail latency, cache hit rate, recall, precision, permission filtering, re-ranking latency, and cost per query.

The key design principle is: scale retrieval by reducing the search space, not by scanning everything faster.


⭐ Final Insight

Billion-document retrieval 不是简单地“加更多机器搜索”。

真正的核心是:

Indexing

  • Sharding
  • Query Routing
  • ANN Search
  • Metadata Filtering
  • Candidate Generation
  • Re-ranking
  • Caching
  • Replication
  • Freshness Control。

大规模 retrieval 的关键不是:

“把所有东西都搜一遍”

而是:

尽早减少 search space, 同时保留足够 recall。

最重要的一句话:

Scale retrieval by reducing the search space, not by scanning everything faster.


中文部分


🎯 How Retrieval Systems Scale to Billion Documents


1️⃣ 核心框架

讨论 billion-document scale 的 retrieval systems 时,我通常从这些方面分析:

  1. 为什么 retrieval 在 scale 下变难
  2. Indexing architecture
  3. Sharding and partitioning
  4. Distributed query execution
  5. Vector search at scale
  6. Hybrid search and ranking
  7. Freshness and re-indexing
  8. 核心权衡:recall vs latency vs cost

2️⃣ 为什么 Retrieval 在 Scale 下变难?

搜索几千个 documents 很简单。

搜索 billions of documents 是 distributed systems problem。


Challenges

在 billion-document scale 下, 系统必须处理:


Core Problem

More documents
→ Larger index
→ More shards
→ More network calls
→ Higher latency
→ Harder ranking

👉 面试回答

Billion-document scale 的 retrieval 不只是 search problem。

它是一个 distributed systems problem, 涉及 indexing、sharding、replication、 ranking、caching、freshness、 access control 和 cost control。


3️⃣ High-Level Architecture


Architecture

Documents
→ Ingestion Pipeline
→ Parser / Cleaner
→ Chunker
→ Embedding / Index Builder
→ Distributed Index
→ Query Router
→ Shard Search
→ Result Aggregator
→ Re-ranker
→ Final Results

Two Main Paths

Offline Indexing Path

Documents
→ Parse
→ Chunk
→ Embed
→ Build Index
→ Replicate Index

Online Query Path

Query
→ Query Router
→ Search Relevant Shards
→ Merge Candidates
→ Re-rank
→ Return Results

👉 面试回答

Large-scale retrieval systems 通常把 offline indexing 和 online query serving 分开。

Offline path 负责构建和更新 distributed indexes。

Online path 负责把 query route 到 shards、 retrieve candidates、merge results、 re-rank, 并返回 final answer 或 context。


4️⃣ Indexing at Scale


什么是 Index?

Index 是让 search 变快的数据结构。

Without index:

Scan all documents
→ Too slow

With index:

Query
→ Lookup relevant candidates
→ Return results quickly

Index Types

Index Type Best For
Inverted index Keyword search
Vector index Semantic search
Metadata index Filtering
Graph index Relationship retrieval
Column index Structured filters

Important Point

在 scale 下,indexes 必须是 distributed。


👉 面试回答

Billion-document scale 的 retrieval 依赖 indexes。

Keyword retrieval 使用 inverted indexes, vector retrieval 使用 vector indexes, metadata filtering 使用 structured indexes。

这些 indexes 必须分布在多台机器上。


5️⃣ Sharding


为什么需要 Sharding?

单台机器无法高效存储 或搜索 billions of documents。

所以 index 要拆成 shards。

Global Index
→ Shard 1
→ Shard 2
→ Shard 3
→ ...

Sharding Strategies

Strategy Description
Hash sharding Distribute documents evenly
Range sharding Partition by ID or timestamp
Tenant sharding Partition by customer or organization
Semantic sharding Group similar content
Hybrid sharding Combine multiple strategies

Trade-off

Hash sharding 平衡 load。

Semantic 或 tenant sharding 改善 locality。


👉 面试回答

Sharding 把 retrieval index 拆分到多台 machines 上。

Hash sharding 有良好的 load balance, tenant、range 或 semantic sharding 可以改善 locality 和 filtering。

正确策略取决于 query patterns 和 access-control requirements。


6️⃣ Query Routing


什么是 Query Routing?

Query routing 决定 query 应该发送到哪些 shards。


Simple Approach

Query all shards
→ Merge results

Recall 高, 但成本很高。


Smarter Approach

Query router
→ Select likely relevant shards
→ Search subset of shards

Latency 和 cost 低, 但可能降低 recall。


Routing Signals


👉 面试回答

Query routing 控制哪些 shards 会被搜索。

搜索所有 shards 可以提高 recall, 但会增加 latency 和 cost。

在 scale 下, 系统通常使用 tenant、metadata、 region、language 或 semantic clusters 等 signals 来减少 shard fanout。


7️⃣ Distributed Query Execution


Query Fanout

在 scale 下, 一个 query 可能发送到很多 shards。

Query
→ Shard 1
→ Shard 2
→ Shard 3
→ ...
→ Merge results

Problems


Common Controls


👉 面试回答

Large-scale retrieval 使用 distributed query execution。

系统会把 query fan out 到多个 shards, 收集 top candidates, 合并结果, 处理 timeouts, 并在部分 shards 慢或不可用时 返回 partial results。


8️⃣ Vector Search at Scale


为什么 Vector Search 很难?

在 billions of vectors 上做 exact nearest-neighbor search 成本太高。


Solution

使用 Approximate Nearest Neighbor search。

Exact search
→ Highest recall
→ Too slow

Approximate search
→ Slightly lower recall
→ Much faster

Common ANN Indexes

Index Strength
HNSW Fast high-recall search
IVF Good for large-scale clustering
PQ Compression and lower memory
ScaNN / DiskANN-style systems Large-scale vector retrieval

Trade-off

Higher recall
→ More compute and latency

Lower latency
→ Possible recall loss

👉 面试回答

Billion-scale vector retrieval 通常使用 approximate nearest-neighbor search。

ANN indexes 用少量 recall 损失, 换取更低 latency 和 cost。

关键 tuning 问题是平衡 recall、 latency、memory 和 index build cost。


9️⃣ Candidate Generation and Re-ranking


Two-Stage Retrieval

大规模系统通常不会对所有 documents 深度排序。

它们使用 two stages。


Stage 1: Candidate Generation

快速检索可能相关的 candidates。

Search index
→ Top 100 or Top 1000 candidates

Stage 2: Re-ranking

在较小 candidate set 上使用更强 ranker。

Candidates
→ Re-ranker
→ Top 5 or Top 10 final results

Why This Works

Fast retrieval 给 recall。

Re-ranking 给 precision。


👉 面试回答

Large-scale retrieval 通常使用 two-stage design。

第一阶段从 distributed indexes 快速检索 candidates。

第二阶段在较小 candidate set 上 使用更强的 re-ranker 提升 precision。


🔟 Hybrid Retrieval at Scale


为什么 Hybrid Retrieval 重要?

在 billion-document scale, keyword search 或 vector search 单独使用都不够。


Hybrid Design

Query
→ Keyword Search
→ Vector Search
→ Metadata Filtering
→ Merge Candidates
→ Re-rank

Benefits


👉 面试回答

Production retrieval systems 通常使用 hybrid retrieval。

Keyword search 处理 exact terms, vector search 处理 semantic meaning, metadata filters enforce constraints, re-ranking 提升 final relevance。


1️⃣1️⃣ Metadata Filtering and Access Control


为什么它很关键?

Enterprise retrieval 通常需要 per-user permissions。


Access Control Problem

User query
→ Retrieval system
→ Must only return allowed documents

Filtering Strategies


Best Practice

尽可能早地 filter。


👉 面试回答

Access control 在 scale 下非常关键。

Retrieval system 必须确保用户只能 retrieve 自己有权限查看的 documents。

在 enterprise systems 中, permission filtering 应该尽可能早发生在 retrieval path 中。


1️⃣2️⃣ Freshness and Re-indexing


Documents Constantly Change

大规模系统要处理:


Index Update Strategies

Strategy Description
Batch indexing Periodic large updates
Streaming indexing Near real-time updates
Delta indexing Only update changed documents
Blue-green index rebuild Build new index, then swap
Lazy refresh Update on demand

Trade-off

Higher freshness
→ More indexing cost

Lower cost
→ More stale results

👉 面试回答

Retrieval systems 需要 freshness strategy。

有些系统使用 batch indexing, 有些使用 streaming 或 delta indexing。

核心权衡是 freshness 和 indexing cost / operational complexity。


1️⃣3️⃣ Caching


为什么 Caching 重要?

Popular queries 会重复。

Retrieval 成本高。


Cache Layers


Cache Risks


Important Rule

Cache keys 必须包含 permission 和 context signals。


👉 面试回答

Caching 可以降低 retrieval latency 和 cost, 但 cache keys 必须包含 user permissions、 filters、query version 和 freshness requirements。

否则 caching 可能造成 stale results 或 security leaks。


1️⃣4️⃣ Replication and Availability


为什么需要 Replication?

Search systems 必须在 machines fail 时保持可用。


Replication Strategy

Shard 1
→ Replica A
→ Replica B
→ Replica C

Benefits


Trade-off

更多 replicas 会增加 cost。


👉 面试回答

Large retrieval systems 会 replicate shards, 用于 availability 和 read scalability。

Replication 提高 fault tolerance 并降低 tail latency, 但会增加 storage 和 operational cost。


1️⃣5️⃣ Observability


What to Monitor


Debugging Questions


👉 面试回答

Observability 对 large-scale retrieval 非常重要。

我会追踪 latency、recall、precision、 shard health、index freshness、 cache hit rate、permission filtering 和 ranking quality。

没有 observability, retrieval failures 很难 debug。


1️⃣6️⃣ Best Practices


Practical Rules


Design Principle

Scale retrieval by reducing the search space,
not by scanning everything faster.

👉 面试回答

在 billion-document scale, 关键是减少 search space。

好的系统使用 sharding、routing、 metadata filtering、approximate search、 candidate generation、re-ranking、 caching 和 replication 来平衡 recall、latency 和 cost。


🧠 Staff-Level Answer Final


👉 面试回答完整版本

把 retrieval 扩展到 billions of documents, 本质上是 distributed systems problem。

主要挑战包括 storage volume、indexing、 sharding、query routing、ranking、 freshness、access control、latency 和 cost。

典型 architecture 会把 offline indexing path 和 online query path 分开。

Offline path 负责 parse documents、 clean、chunk、generate embeddings 或 keyword indexes, 构建 distributed indexes, 并复制它们。

Online path 接收 query, route 到 relevant shards, 执行 keyword 或 vector search, merge candidate results, 应用 metadata 和 permission filters, re-rank candidates, 然后返回 final results。

Sharding 是必须的, 因为单台机器无法高效存储 或搜索 billions of documents。

Hash sharding 提供良好的 load balance, tenant、range、metadata 或 semantic sharding 可以改善 locality 和 access control。

Query routing 很重要, 因为搜索每个 shard 成本很高。

系统应该使用 metadata、tenant、 language、region、time range 或 semantic cluster signals 来减少 fanout。

对于 vector search, 在这种 scale 下 exact nearest-neighbor search 成本太高, 所以 production systems 通常使用 approximate nearest-neighbor indexes。

ANN 用少量 recall 损失, 换取更好的 latency、memory efficiency 和 cost。

大型系统通常也使用 two-stage retrieval: 先快速检索较大的 candidate set, 再用更强的 re-ranker 对较小集合排序。

Hybrid retrieval 通常最适合 RAG systems, 因为 keyword search 捕捉 exact terms, vector search 捕捉 semantic similarity, metadata filtering enforce constraints, re-ranking 提升 precision。

Freshness 是另一个 major challenge。

系统需要 batch、streaming、delta 或 blue-green indexing strategies 来处理 new、updated 和 deleted documents。

最后,observability 非常关键。

我们需要追踪 shard health、 index freshness、query latency、 tail latency、cache hit rate、 recall、precision、permission filtering、 re-ranking latency 和 cost per query。

核心设计原则是: scale retrieval by reducing the search space, not by scanning everything faster。


⭐ Final Insight

Billion-document retrieval 不是简单地“加更多机器搜索”。

真正的核心是:

Indexing

  • Sharding
  • Query Routing
  • ANN Search
  • Metadata Filtering
  • Candidate Generation
  • Re-ranking
  • Caching
  • Replication
  • Freshness Control。

大规模 retrieval 的关键不是:

“把所有东西都搜一遍”

而是:

尽早减少 search space, 同时保留足够 recall。

最重要的一句话:

Scale retrieval by reducing the search space, not by scanning everything faster.


Implement