System Design Deep Dive - 03 Embeddings & Vector DB

Post by ailswan May. 26, 2026

中文 ↓

🎯 Embeddings & Vector Database


1️⃣ Core Framework

When discussing Embeddings & Vector DB, I frame it as:

  1. What embeddings are
  2. How embeddings are generated
  3. Vector similarity search
  4. Indexing and storage
  5. Query pipeline
  6. Scaling and optimization
  7. Trade-offs: accuracy vs latency vs cost

2️⃣ What Are Embeddings?


Definition

An embedding is a numerical vector representation of text.

"refund policy" → [0.12, -0.44, 0.89, ...]

Key Idea

Semantically similar text → similar vectors

"refund policy"
"return money rules"
→ vectors close in space

Why Important?

Embeddings enable:


👉 Interview Answer

Embeddings convert text into vectors that capture semantic meaning.

This allows us to perform similarity search, where queries retrieve documents based on meaning rather than exact keywords.


3️⃣ Embedding Models


Input


Output


Example

Input: "How to refund an order?"
Output: [0.23, -0.12, ..., 0.91]

Properties


👉 Interview Answer

Embedding models take text input and output fixed-length vectors.

These vectors are used to measure similarity between queries and documents.


4️⃣ Similarity Search


Goal

Find most similar vectors.


Common Metrics

Cosine Similarity

similarity = cos(angle between vectors)

Dot Product

similarity = v1 · v2

Euclidean Distance

distance = ||v1 - v2||

Example

Query: "refund policy"
→ embedding
→ find nearest vectors in DB
→ return top-K results

👉 Interview Answer

Vector search finds nearest neighbors in embedding space.

Cosine similarity is commonly used because it measures directional similarity, which works well for text embeddings.


5️⃣ Vector Database


What Is a Vector DB?

A database optimized for storing and searching vectors.


Stores

embedding vector
+ original text
+ metadata

Example Record

{
  "embedding": [...],
  "text": "Refunds are processed within 5 days",
  "metadata": {
    "doc_id": "123",
    "section": "refund",
    "source": "help_center"
  }
}

Key Features


👉 Interview Answer

A vector database stores embeddings and supports fast similarity search.

It is the core component of RAG systems, enabling retrieval of relevant document chunks at query time.


6️⃣ Indexing Strategies


Problem

Linear scan is too slow:

O(N)

Solution: ANN (Approximate Nearest Neighbor)


Common Index Types

HNSW (Hierarchical Navigable Small World)


IVF (Inverted File Index)


Flat Index


Trade-off

Index Speed Accuracy Memory
Flat Low High High
HNSW High High Medium
IVF High Medium Low

👉 Interview Answer

At scale, vector databases use approximate nearest neighbor indexing to reduce search complexity from linear to sublinear time.

HNSW is commonly used because it provides a good balance between speed and recall.


7️⃣ Query Pipeline


Flow

User query
→ Convert to embedding
→ Search vector DB
→ Retrieve top-K results
→ (Optional) re-rank
→ Return results

With Metadata Filtering

WHERE source = "help_center"
AND region = "US"

Combine:

semantic similarity + keyword search

👉 Interview Answer

The query pipeline converts the user query into an embedding, retrieves the most similar vectors, optionally applies filters and re-ranking, and returns relevant results.


8️⃣ Chunking Strategy


Why Needed?

Documents are too large for embedding directly.


Approach

Document → chunks (200–500 tokens)

Trade-offs

Chunk Size Pros Cons
Small Precise retrieval Lose context
Large More context More noise

Best Practice


👉 Interview Answer

Chunking is critical because embeddings operate on chunks.

Good chunking improves retrieval quality by balancing context and precision.


9️⃣ Re-ranking


Problem

Top-K retrieval may include noise.


Solution

Re-rank results using:


Flow

Top 50 retrieved
→ re-rank
→ top 5 used for context

👉 Interview Answer

Re-ranking improves precision after retrieval.

It helps select the most relevant documents before passing them into the LLM.


🔟 Scaling Challenges


Challenges


Solutions


Sharding Example

shard_id = hash(doc_id) % N

👉 Interview Answer

At scale, vector databases need to distribute data across shards, use approximate search, and cache popular queries.

This ensures low latency and high throughput.


1️⃣1️⃣ Freshness and Updates


Problem

Data changes over time.


Options


Trade-off

Approach Freshness Cost
Batch Low Low
Streaming High High

👉 Interview Answer

Keeping embeddings fresh is important.

Systems can use batch updates for efficiency or streaming updates for real-time freshness, depending on requirements.


1️⃣2️⃣ Cost Optimization


Cost Drivers


Optimizations


👉 Interview Answer

Embeddings can be expensive at scale.

Optimizations include caching, reducing dimensions, limiting retrieval size, and using approximate search.


1️⃣3️⃣ Failure Modes


Common Issues


Mitigation


👉 Interview Answer

Most failures in embedding systems come from poor retrieval.

Improving chunking, indexing, and ranking is critical for system performance.


1️⃣4️⃣ Trade-offs


Dimension Trade-off
Accuracy vs Latency Exact vs ANN
Recall vs Precision More docs vs better docs
Cost vs Quality Larger embeddings vs cheaper models
Freshness vs Efficiency Streaming vs batch

👉 Interview Answer

Designing embedding systems requires balancing accuracy, latency, cost, and freshness.


1️⃣5️⃣ End-to-End Flow


Documents
→ Chunking
→ Embedding
→ Store in vector DB

User query
→ Query embedding
→ Similarity search
→ Retrieve top-K
→ Re-rank
→ Return results

Key Insight

Embeddings enable meaning-based retrieval, which is the foundation of modern AI systems.


🧠 Staff-Level Answer (Final)


👉 Interview Answer Full Version

Embeddings convert text into vectors that capture semantic meaning, allowing systems to perform similarity search.

A vector database stores these embeddings along with metadata and supports efficient nearest-neighbor search.

In a typical system, documents are processed offline, split into chunks, converted into embeddings, and indexed in a vector database.

At query time, the user query is also embedded, and the system retrieves the most similar document chunks.

Because exact search is too slow at scale, vector databases use approximate nearest neighbor indexing, such as HNSW, to achieve fast retrieval with high recall.

Additional techniques like metadata filtering, hybrid search, and re-ranking improve retrieval precision.

Chunking strategy is critical, because embeddings operate at the chunk level.

Poor chunking can significantly reduce retrieval quality.

At scale, systems need to handle large datasets, high query throughput, and frequent updates, using sharding, caching, and efficient indexing.

The main trade-offs include accuracy, latency, cost, and freshness.

Ultimately, embeddings and vector databases provide the foundation for semantic search and RAG systems.


⭐ Final Insight

Embeddings 的本质是把“语义”变成“向量空间中的距离”, Vector DB 的本质是“在高维空间中快速找最近邻”。



中文部分


🎯 Embeddings & 向量数据库设计


1️⃣ 核心框架

设计 Embedding + Vector DB 时可以从:

  1. Embedding 原理
  2. 向量生成
  3. 相似度搜索
  4. 索引结构
  5. 查询流程
  6. 扩展性
  7. 权衡

2️⃣ Embedding 是什么?


Embedding 是文本的向量表示:

文本 → 向量

核心思想

语义相近 → 向量相近


作用


👉 面试回答

Embedding 把文本转换成语义向量, 使系统可以基于“意义”而不是“关键词”进行检索。


3️⃣ Vector DB 是什么?


存储:

向量 + 文本 + metadata

支持:


4️⃣ 查询流程


query → embedding → vector search → top-K

5️⃣ 索引


常见:


6️⃣ 核心优化



7️⃣ 核心问题



8️⃣ 核心权衡



🧠 面试总结


Embedding + Vector DB 是 RAG 的基础。

它通过向量空间实现语义搜索, 通过 ANN 索引实现高效查询。

系统关键在 chunking、retrieval 和 ranking, 因为 retrieval 决定了模型能看到什么信息。


⭐ 一句话总结

Embedding = 把语义变成向量 Vector DB = 在向量空间里找最相似的内容

Implement