🎯 How Retrieval Systems Scale to Billion Documents
1️⃣ Core Framework
When discussing retrieval systems at billion-document scale, I frame it as:
- Why retrieval becomes hard at scale
- Indexing architecture
- Sharding and partitioning
- Distributed query execution
- Vector search at scale
- Hybrid search and ranking
- Freshness and re-indexing
- Trade-offs: recall vs latency vs cost
2️⃣ Why Retrieval Becomes Hard at Scale
Searching a few thousand documents is easy.
Searching billions of documents is a distributed systems problem.
Challenges
At billion-document scale, the system must handle:
- Huge storage volume
- Large indexes
- High query traffic
- Low latency requirements
- Frequent document updates
- Access control
- Ranking quality
- Cost constraints
Core Problem
More documents
→ Larger index
→ More shards
→ More network calls
→ Higher latency
→ Harder ranking
👉 Interview Answer
Retrieval at billion-document scale is not just a search problem.
It is a distributed systems problem involving indexing, sharding, replication, ranking, caching, freshness, access control, and cost control.
3️⃣ High-Level Architecture
Architecture
Documents
→ Ingestion Pipeline
→ Parser / Cleaner
→ Chunker
→ Embedding / Index Builder
→ Distributed Index
→ Query Router
→ Shard Search
→ Result Aggregator
→ Re-ranker
→ Final Results
Two Main Paths
Offline Indexing Path
Documents
→ Parse
→ Chunk
→ Embed
→ Build Index
→ Replicate Index
Online Query Path
Query
→ Query Router
→ Search Relevant Shards
→ Merge Candidates
→ Re-rank
→ Return Results
👉 Interview Answer
Large-scale retrieval systems usually separate offline indexing from online query serving.
The offline path builds and updates distributed indexes.
The online path routes queries to shards, retrieves candidates, merges results, re-ranks them, and returns the final answer or context.
4️⃣ Indexing at Scale
What Is an Index?
An index is a data structure that makes search fast.
Without an index:
Scan all documents
→ Too slow
With an index:
Query
→ Lookup relevant candidates
→ Return results quickly
Index Types
| Index Type | Best For |
|---|---|
| Inverted index | Keyword search |
| Vector index | Semantic search |
| Metadata index | Filtering |
| Graph index | Relationship retrieval |
| Column index | Structured filters |
Important Point
At scale, indexes must be distributed.
👉 Interview Answer
At billion-document scale, retrieval depends on indexes.
Keyword retrieval uses inverted indexes, vector retrieval uses vector indexes, and metadata filtering uses structured indexes.
These indexes must be distributed across many machines.
5️⃣ Sharding
Why Sharding Is Needed
A single machine cannot store or search billions of documents efficiently.
So the index is split across shards.
Global Index
→ Shard 1
→ Shard 2
→ Shard 3
→ ...
Sharding Strategies
| Strategy | Description |
|---|---|
| Hash sharding | Distribute documents evenly |
| Range sharding | Partition by ID or timestamp |
| Tenant sharding | Partition by customer or organization |
| Semantic sharding | Group similar content |
| Hybrid sharding | Combine multiple strategies |
Trade-off
Hash sharding balances load.
Semantic or tenant sharding improves locality.
👉 Interview Answer
Sharding splits the retrieval index across multiple machines.
Hash sharding gives good load balance, while tenant, range, or semantic sharding can improve locality and filtering.
The right strategy depends on query patterns and access-control requirements.
6️⃣ Query Routing
What Is Query Routing?
Query routing decides which shards should receive a query.
Simple Approach
Query all shards
→ Merge results
Good recall, but expensive.
Smarter Approach
Query router
→ Select likely relevant shards
→ Search subset of shards
Lower latency and cost, but may reduce recall.
Routing Signals
- Tenant ID
- Language
- Region
- Document type
- Time range
- Metadata filters
- Semantic cluster
👉 Interview Answer
Query routing controls which shards are searched.
Searching all shards gives higher recall but increases latency and cost.
At scale, systems use routing signals like tenant, metadata, region, language, or semantic clusters to search fewer shards.
7️⃣ Distributed Query Execution
Query Fanout
At scale, one query may be sent to many shards.
Query
→ Shard 1
→ Shard 2
→ Shard 3
→ ...
→ Merge results
Problems
- Network latency
- Straggler shards
- Partial failures
- Result merging
- Ranking consistency
- Timeout handling
Common Controls
- Parallel shard search
- Per-shard timeout
- Top-k per shard
- Result aggregation
- Partial result fallback
- Tail latency mitigation
👉 Interview Answer
Large-scale retrieval uses distributed query execution.
The system fans out a query to multiple shards, collects top candidates, merges them, handles timeouts, and may return partial results when some shards are slow or unavailable.
8️⃣ Vector Search at Scale
Why Vector Search Is Hard
Exact nearest-neighbor search over billions of vectors is too expensive.
Solution
Use Approximate Nearest Neighbor search.
Exact search
→ Highest recall
→ Too slow
Approximate search
→ Slightly lower recall
→ Much faster
Common ANN Indexes
| Index | Strength |
|---|---|
| HNSW | Fast high-recall search |
| IVF | Good for large-scale clustering |
| PQ | Compression and lower memory |
| ScaNN / DiskANN-style systems | Large-scale vector retrieval |
Trade-off
Higher recall
→ More compute and latency
Lower latency
→ Possible recall loss
👉 Interview Answer
Billion-scale vector retrieval usually uses approximate nearest-neighbor search.
ANN indexes trade a small amount of recall for much lower latency and cost.
The key tuning problem is balancing recall, latency, memory, and index build cost.
9️⃣ Candidate Generation and Re-ranking
Two-Stage Retrieval
Most large systems do not rank everything deeply.
They use two stages.
Stage 1: Candidate Generation
Retrieve many possible matches quickly.
Search index
→ Top 100 or Top 1000 candidates
Stage 2: Re-ranking
Use stronger ranking model on fewer candidates.
Candidates
→ Re-ranker
→ Top 5 or Top 10 final results
Why This Works
Fast retrieval gives recall.
Re-ranking gives precision.
👉 Interview Answer
Large-scale retrieval usually uses a two-stage design.
The first stage retrieves candidates quickly from distributed indexes.
The second stage uses a stronger re-ranker to improve precision on a smaller candidate set.
🔟 Hybrid Retrieval at Scale
Why Hybrid Retrieval Matters
At billion-document scale, neither keyword nor vector search is enough alone.
Hybrid Design
Query
→ Keyword Search
→ Vector Search
→ Metadata Filtering
→ Merge Candidates
→ Re-rank
Benefits
- Exact match coverage
- Semantic recall
- Better ranking quality
- Better enterprise search
- More robust RAG retrieval
👉 Interview Answer
Production retrieval systems often use hybrid retrieval.
Keyword search handles exact terms, vector search handles semantic meaning, metadata filters enforce constraints, and re-ranking improves final relevance.
1️⃣1️⃣ Metadata Filtering and Access Control
Why It Is Critical
Enterprise retrieval often needs per-user permissions.
Access Control Problem
User query
→ Retrieval system
→ Must only return allowed documents
Filtering Strategies
- Pre-filter by permissions
- Post-filter after candidate retrieval
- Tenant-level partitioning
- Access-control metadata index
- Document-level ACLs
Best Practice
Filter as early as possible.
👉 Interview Answer
Access control is critical at scale.
The retrieval system must ensure users only retrieve documents they are allowed to see.
In enterprise systems, permission filtering should happen as early as possible in the retrieval path.
1️⃣2️⃣ Freshness and Re-indexing
Documents Constantly Change
Large-scale systems handle:
- New documents
- Updated documents
- Deleted documents
- Permission changes
- Metadata changes
Index Update Strategies
| Strategy | Description |
|---|---|
| Batch indexing | Periodic large updates |
| Streaming indexing | Near real-time updates |
| Delta indexing | Only update changed documents |
| Blue-green index rebuild | Build new index, then swap |
| Lazy refresh | Update on demand |
Trade-off
Higher freshness
→ More indexing cost
Lower cost
→ More stale results
👉 Interview Answer
Retrieval systems need a freshness strategy.
Some systems use batch indexing, others use streaming or delta indexing.
The trade-off is freshness versus indexing cost and operational complexity.
1️⃣3️⃣ Caching
Why Caching Matters
Popular queries repeat.
Retrieval can be expensive.
Cache Layers
- Query embedding cache
- Retrieval result cache
- Metadata filter cache
- Re-ranker result cache
- Final answer cache
Cache Risks
- Stale results
- Permission leakage
- Incorrect personalization
- Outdated rankings
Important Rule
Cache keys must include permission and context signals.
👉 Interview Answer
Caching reduces retrieval latency and cost, but cache keys must include user permissions, filters, query version, and freshness requirements.
Otherwise, caching can create stale results or security leaks.
1️⃣4️⃣ Replication and Availability
Why Replication Is Needed
Search systems must remain available when machines fail.
Replication Strategy
Shard 1
→ Replica A
→ Replica B
→ Replica C
Benefits
- High availability
- Read scaling
- Failure tolerance
- Lower tail latency
Trade-off
More replicas increase cost.
👉 Interview Answer
Large retrieval systems replicate shards for availability and read scalability.
Replication improves fault tolerance and reduces tail latency, but increases storage and operational cost.
1️⃣5️⃣ Observability
What to Monitor
- Query latency
- Tail latency
- Recall / precision
- Shard health
- Index freshness
- Cache hit rate
- Re-ranking latency
- Failed shard queries
- Permission filtering errors
- Cost per query
Debugging Questions
- Which shards were searched?
- Which candidates were retrieved?
- Which results were filtered?
- Why was a result ranked high?
- Was the index stale?
👉 Interview Answer
Observability is essential for large-scale retrieval.
I would track latency, recall, precision, shard health, index freshness, cache hit rate, permission filtering, and ranking quality.
Without observability, retrieval failures are very hard to debug.
1️⃣6️⃣ Best Practices
Practical Rules
- Separate indexing path from query path
- Shard indexes by query and access patterns
- Use ANN for large-scale vector search
- Use hybrid retrieval for robustness
- Filter permissions early
- Use two-stage retrieval and re-ranking
- Cache carefully with permission-aware keys
- Replicate shards for availability
- Track index freshness
- Monitor recall, latency, and cost
Design Principle
Scale retrieval by reducing the search space,
not by scanning everything faster.
👉 Interview Answer
At billion-document scale, the key is reducing the search space.
Good systems use sharding, routing, metadata filtering, approximate search, candidate generation, re-ranking, caching, and replication to balance recall, latency, and cost.
🧠 Staff-Level Answer Final
👉 Interview Answer Full Version
Scaling retrieval to billions of documents is a distributed systems problem.
The main challenges are storage volume, indexing, sharding, query routing, ranking, freshness, access control, latency, and cost.
A typical architecture separates the offline indexing path from the online query path.
The offline path parses documents, cleans them, chunks them, generates embeddings or keyword indexes, builds distributed indexes, and replicates them.
The online path receives a query, routes it to relevant shards, performs keyword or vector search, merges candidate results, applies metadata and permission filters, re-ranks candidates, and returns the final results.
Sharding is required because one machine cannot store or search billions of documents efficiently.
Hash sharding gives good load balance, while tenant, range, metadata, or semantic sharding can improve locality and access control.
Query routing is important because searching every shard is expensive.
The system should use metadata, tenant, language, region, time range, or semantic cluster signals to reduce fanout.
For vector search, exact nearest-neighbor search is too expensive at this scale, so production systems usually use approximate nearest-neighbor indexes.
ANN trades a small amount of recall for much better latency, memory efficiency, and cost.
Most large systems also use two-stage retrieval: first retrieve a larger set of candidates quickly, then use a stronger re-ranker on a smaller set.
Hybrid retrieval is often best for RAG systems because keyword search captures exact terms, vector search captures semantic similarity, metadata filtering enforces constraints, and re-ranking improves precision.
Freshness is another major challenge.
The system needs batch, streaming, delta, or blue-green indexing strategies to handle new, updated, and deleted documents.
Finally, observability is critical.
We need to track shard health, index freshness, query latency, tail latency, cache hit rate, recall, precision, permission filtering, re-ranking latency, and cost per query.
The key design principle is: scale retrieval by reducing the search space, not by scanning everything faster.
⭐ Final Insight
Billion-document retrieval 不是简单地“加更多机器搜索”。
真正的核心是:
Indexing
- Sharding
- Query Routing
- ANN Search
- Metadata Filtering
- Candidate Generation
- Re-ranking
- Caching
- Replication
- Freshness Control。
大规模 retrieval 的关键不是:
“把所有东西都搜一遍”
而是:
尽早减少 search space, 同时保留足够 recall。
最重要的一句话:
Scale retrieval by reducing the search space, not by scanning everything faster.
中文部分
🎯 How Retrieval Systems Scale to Billion Documents
1️⃣ 核心框架
讨论 billion-document scale 的 retrieval systems 时,我通常从这些方面分析:
- 为什么 retrieval 在 scale 下变难
- Indexing architecture
- Sharding and partitioning
- Distributed query execution
- Vector search at scale
- Hybrid search and ranking
- Freshness and re-indexing
- 核心权衡:recall vs latency vs cost
2️⃣ 为什么 Retrieval 在 Scale 下变难?
搜索几千个 documents 很简单。
搜索 billions of documents 是 distributed systems problem。
Challenges
在 billion-document scale 下, 系统必须处理:
- Huge storage volume
- Large indexes
- High query traffic
- Low latency requirements
- Frequent document updates
- Access control
- Ranking quality
- Cost constraints
Core Problem
More documents
→ Larger index
→ More shards
→ More network calls
→ Higher latency
→ Harder ranking
👉 面试回答
Billion-document scale 的 retrieval 不只是 search problem。
它是一个 distributed systems problem, 涉及 indexing、sharding、replication、 ranking、caching、freshness、 access control 和 cost control。
3️⃣ High-Level Architecture
Architecture
Documents
→ Ingestion Pipeline
→ Parser / Cleaner
→ Chunker
→ Embedding / Index Builder
→ Distributed Index
→ Query Router
→ Shard Search
→ Result Aggregator
→ Re-ranker
→ Final Results
Two Main Paths
Offline Indexing Path
Documents
→ Parse
→ Chunk
→ Embed
→ Build Index
→ Replicate Index
Online Query Path
Query
→ Query Router
→ Search Relevant Shards
→ Merge Candidates
→ Re-rank
→ Return Results
👉 面试回答
Large-scale retrieval systems 通常把 offline indexing 和 online query serving 分开。
Offline path 负责构建和更新 distributed indexes。
Online path 负责把 query route 到 shards、 retrieve candidates、merge results、 re-rank, 并返回 final answer 或 context。
4️⃣ Indexing at Scale
什么是 Index?
Index 是让 search 变快的数据结构。
Without index:
Scan all documents
→ Too slow
With index:
Query
→ Lookup relevant candidates
→ Return results quickly
Index Types
| Index Type | Best For |
|---|---|
| Inverted index | Keyword search |
| Vector index | Semantic search |
| Metadata index | Filtering |
| Graph index | Relationship retrieval |
| Column index | Structured filters |
Important Point
在 scale 下,indexes 必须是 distributed。
👉 面试回答
Billion-document scale 的 retrieval 依赖 indexes。
Keyword retrieval 使用 inverted indexes, vector retrieval 使用 vector indexes, metadata filtering 使用 structured indexes。
这些 indexes 必须分布在多台机器上。
5️⃣ Sharding
为什么需要 Sharding?
单台机器无法高效存储 或搜索 billions of documents。
所以 index 要拆成 shards。
Global Index
→ Shard 1
→ Shard 2
→ Shard 3
→ ...
Sharding Strategies
| Strategy | Description |
|---|---|
| Hash sharding | Distribute documents evenly |
| Range sharding | Partition by ID or timestamp |
| Tenant sharding | Partition by customer or organization |
| Semantic sharding | Group similar content |
| Hybrid sharding | Combine multiple strategies |
Trade-off
Hash sharding 平衡 load。
Semantic 或 tenant sharding 改善 locality。
👉 面试回答
Sharding 把 retrieval index 拆分到多台 machines 上。
Hash sharding 有良好的 load balance, tenant、range 或 semantic sharding 可以改善 locality 和 filtering。
正确策略取决于 query patterns 和 access-control requirements。
6️⃣ Query Routing
什么是 Query Routing?
Query routing 决定 query 应该发送到哪些 shards。
Simple Approach
Query all shards
→ Merge results
Recall 高, 但成本很高。
Smarter Approach
Query router
→ Select likely relevant shards
→ Search subset of shards
Latency 和 cost 低, 但可能降低 recall。
Routing Signals
- Tenant ID
- Language
- Region
- Document type
- Time range
- Metadata filters
- Semantic cluster
👉 面试回答
Query routing 控制哪些 shards 会被搜索。
搜索所有 shards 可以提高 recall, 但会增加 latency 和 cost。
在 scale 下, 系统通常使用 tenant、metadata、 region、language 或 semantic clusters 等 signals 来减少 shard fanout。
7️⃣ Distributed Query Execution
Query Fanout
在 scale 下, 一个 query 可能发送到很多 shards。
Query
→ Shard 1
→ Shard 2
→ Shard 3
→ ...
→ Merge results
Problems
- Network latency
- Straggler shards
- Partial failures
- Result merging
- Ranking consistency
- Timeout handling
Common Controls
- Parallel shard search
- Per-shard timeout
- Top-k per shard
- Result aggregation
- Partial result fallback
- Tail latency mitigation
👉 面试回答
Large-scale retrieval 使用 distributed query execution。
系统会把 query fan out 到多个 shards, 收集 top candidates, 合并结果, 处理 timeouts, 并在部分 shards 慢或不可用时 返回 partial results。
8️⃣ Vector Search at Scale
为什么 Vector Search 很难?
在 billions of vectors 上做 exact nearest-neighbor search 成本太高。
Solution
使用 Approximate Nearest Neighbor search。
Exact search
→ Highest recall
→ Too slow
Approximate search
→ Slightly lower recall
→ Much faster
Common ANN Indexes
| Index | Strength |
|---|---|
| HNSW | Fast high-recall search |
| IVF | Good for large-scale clustering |
| PQ | Compression and lower memory |
| ScaNN / DiskANN-style systems | Large-scale vector retrieval |
Trade-off
Higher recall
→ More compute and latency
Lower latency
→ Possible recall loss
👉 面试回答
Billion-scale vector retrieval 通常使用 approximate nearest-neighbor search。
ANN indexes 用少量 recall 损失, 换取更低 latency 和 cost。
关键 tuning 问题是平衡 recall、 latency、memory 和 index build cost。
9️⃣ Candidate Generation and Re-ranking
Two-Stage Retrieval
大规模系统通常不会对所有 documents 深度排序。
它们使用 two stages。
Stage 1: Candidate Generation
快速检索可能相关的 candidates。
Search index
→ Top 100 or Top 1000 candidates
Stage 2: Re-ranking
在较小 candidate set 上使用更强 ranker。
Candidates
→ Re-ranker
→ Top 5 or Top 10 final results
Why This Works
Fast retrieval 给 recall。
Re-ranking 给 precision。
👉 面试回答
Large-scale retrieval 通常使用 two-stage design。
第一阶段从 distributed indexes 快速检索 candidates。
第二阶段在较小 candidate set 上 使用更强的 re-ranker 提升 precision。
🔟 Hybrid Retrieval at Scale
为什么 Hybrid Retrieval 重要?
在 billion-document scale, keyword search 或 vector search 单独使用都不够。
Hybrid Design
Query
→ Keyword Search
→ Vector Search
→ Metadata Filtering
→ Merge Candidates
→ Re-rank
Benefits
- Exact match coverage
- Semantic recall
- Better ranking quality
- Better enterprise search
- More robust RAG retrieval
👉 面试回答
Production retrieval systems 通常使用 hybrid retrieval。
Keyword search 处理 exact terms, vector search 处理 semantic meaning, metadata filters enforce constraints, re-ranking 提升 final relevance。
1️⃣1️⃣ Metadata Filtering and Access Control
为什么它很关键?
Enterprise retrieval 通常需要 per-user permissions。
Access Control Problem
User query
→ Retrieval system
→ Must only return allowed documents
Filtering Strategies
- Pre-filter by permissions
- Post-filter after candidate retrieval
- Tenant-level partitioning
- Access-control metadata index
- Document-level ACLs
Best Practice
尽可能早地 filter。
👉 面试回答
Access control 在 scale 下非常关键。
Retrieval system 必须确保用户只能 retrieve 自己有权限查看的 documents。
在 enterprise systems 中, permission filtering 应该尽可能早发生在 retrieval path 中。
1️⃣2️⃣ Freshness and Re-indexing
Documents Constantly Change
大规模系统要处理:
- New documents
- Updated documents
- Deleted documents
- Permission changes
- Metadata changes
Index Update Strategies
| Strategy | Description |
|---|---|
| Batch indexing | Periodic large updates |
| Streaming indexing | Near real-time updates |
| Delta indexing | Only update changed documents |
| Blue-green index rebuild | Build new index, then swap |
| Lazy refresh | Update on demand |
Trade-off
Higher freshness
→ More indexing cost
Lower cost
→ More stale results
👉 面试回答
Retrieval systems 需要 freshness strategy。
有些系统使用 batch indexing, 有些使用 streaming 或 delta indexing。
核心权衡是 freshness 和 indexing cost / operational complexity。
1️⃣3️⃣ Caching
为什么 Caching 重要?
Popular queries 会重复。
Retrieval 成本高。
Cache Layers
- Query embedding cache
- Retrieval result cache
- Metadata filter cache
- Re-ranker result cache
- Final answer cache
Cache Risks
- Stale results
- Permission leakage
- Incorrect personalization
- Outdated rankings
Important Rule
Cache keys 必须包含 permission 和 context signals。
👉 面试回答
Caching 可以降低 retrieval latency 和 cost, 但 cache keys 必须包含 user permissions、 filters、query version 和 freshness requirements。
否则 caching 可能造成 stale results 或 security leaks。
1️⃣4️⃣ Replication and Availability
为什么需要 Replication?
Search systems 必须在 machines fail 时保持可用。
Replication Strategy
Shard 1
→ Replica A
→ Replica B
→ Replica C
Benefits
- High availability
- Read scaling
- Failure tolerance
- Lower tail latency
Trade-off
更多 replicas 会增加 cost。
👉 面试回答
Large retrieval systems 会 replicate shards, 用于 availability 和 read scalability。
Replication 提高 fault tolerance 并降低 tail latency, 但会增加 storage 和 operational cost。
1️⃣5️⃣ Observability
What to Monitor
- Query latency
- Tail latency
- Recall / precision
- Shard health
- Index freshness
- Cache hit rate
- Re-ranking latency
- Failed shard queries
- Permission filtering errors
- Cost per query
Debugging Questions
- 哪些 shards 被 searched?
- 哪些 candidates 被 retrieved?
- 哪些 results 被 filtered?
- 为什么某个 result rank 很高?
- Index 是否 stale?
👉 面试回答
Observability 对 large-scale retrieval 非常重要。
我会追踪 latency、recall、precision、 shard health、index freshness、 cache hit rate、permission filtering 和 ranking quality。
没有 observability, retrieval failures 很难 debug。
1️⃣6️⃣ Best Practices
Practical Rules
- Separate indexing path from query path
- Shard indexes by query and access patterns
- Use ANN for large-scale vector search
- Use hybrid retrieval for robustness
- Filter permissions early
- Use two-stage retrieval and re-ranking
- Cache carefully with permission-aware keys
- Replicate shards for availability
- Track index freshness
- Monitor recall, latency, and cost
Design Principle
Scale retrieval by reducing the search space,
not by scanning everything faster.
👉 面试回答
在 billion-document scale, 关键是减少 search space。
好的系统使用 sharding、routing、 metadata filtering、approximate search、 candidate generation、re-ranking、 caching 和 replication 来平衡 recall、latency 和 cost。
🧠 Staff-Level Answer Final
👉 面试回答完整版本
把 retrieval 扩展到 billions of documents, 本质上是 distributed systems problem。
主要挑战包括 storage volume、indexing、 sharding、query routing、ranking、 freshness、access control、latency 和 cost。
典型 architecture 会把 offline indexing path 和 online query path 分开。
Offline path 负责 parse documents、 clean、chunk、generate embeddings 或 keyword indexes, 构建 distributed indexes, 并复制它们。
Online path 接收 query, route 到 relevant shards, 执行 keyword 或 vector search, merge candidate results, 应用 metadata 和 permission filters, re-rank candidates, 然后返回 final results。
Sharding 是必须的, 因为单台机器无法高效存储 或搜索 billions of documents。
Hash sharding 提供良好的 load balance, tenant、range、metadata 或 semantic sharding 可以改善 locality 和 access control。
Query routing 很重要, 因为搜索每个 shard 成本很高。
系统应该使用 metadata、tenant、 language、region、time range 或 semantic cluster signals 来减少 fanout。
对于 vector search, 在这种 scale 下 exact nearest-neighbor search 成本太高, 所以 production systems 通常使用 approximate nearest-neighbor indexes。
ANN 用少量 recall 损失, 换取更好的 latency、memory efficiency 和 cost。
大型系统通常也使用 two-stage retrieval: 先快速检索较大的 candidate set, 再用更强的 re-ranker 对较小集合排序。
Hybrid retrieval 通常最适合 RAG systems, 因为 keyword search 捕捉 exact terms, vector search 捕捉 semantic similarity, metadata filtering enforce constraints, re-ranking 提升 precision。
Freshness 是另一个 major challenge。
系统需要 batch、streaming、delta 或 blue-green indexing strategies 来处理 new、updated 和 deleted documents。
最后,observability 非常关键。
我们需要追踪 shard health、 index freshness、query latency、 tail latency、cache hit rate、 recall、precision、permission filtering、 re-ranking latency 和 cost per query。
核心设计原则是: scale retrieval by reducing the search space, not by scanning everything faster。
⭐ Final Insight
Billion-document retrieval 不是简单地“加更多机器搜索”。
真正的核心是:
Indexing
- Sharding
- Query Routing
- ANN Search
- Metadata Filtering
- Candidate Generation
- Re-ranking
- Caching
- Replication
- Freshness Control。
大规模 retrieval 的关键不是:
“把所有东西都搜一遍”
而是:
尽早减少 search space, 同时保留足够 recall。
最重要的一句话:
Scale retrieval by reducing the search space, not by scanning everything faster.
Implement