·

System Design Deep Dive - 07 How Google Search Indexing Pipeline Works

Post by ailswan May. 26, 2026

中文 ↓

🎯 How Google Search Indexing Pipeline Works


1️⃣ Core Indexing Framework (Staff-Level)

When discussing a Google-like search indexing pipeline, I frame it as:

  1. URL discovery
  2. Crawl scheduling
  3. Fetching with politeness controls
  4. Parsing and content extraction
  5. Deduplication and canonicalization
  6. Index construction
  7. Freshness updates
  8. Trade-offs: coverage vs freshness vs quality vs cost

2️⃣ Core Problem

The web is huge, noisy, duplicated, and constantly changing.

The indexing system must decide:


👉 Interview Answer

A search indexing pipeline converts an uncontrolled web corpus into structured searchable indexes. The hard part is not just crawling pages. The hard part is prioritizing crawl budget, removing low-quality or duplicate content, and keeping important documents fresh without wasting resources.


3️⃣ High-Level Architecture

URL Discovery
        ↓
Crawl Frontier
        ↓
Fetcher
        ↓
Parser / Extractor
        ↓
Dedup / Canonicalization
        ↓
Document Store
        ↓
Indexer
        ↓
Inverted Index + Ranking Features
        ↓
Query Serving

4️⃣ Crawl Frontier

The crawl frontier decides what URL to fetch next.

Signals:


👉 Interview Answer

The crawl frontier is a priority scheduler. It balances page importance, freshness needs, domain politeness, and crawl budget. Important and frequently changing pages are crawled more often than low-value static pages.


5️⃣ Fetching and Politeness

Fetcher responsibilities:


6️⃣ Parsing and Extraction

The parser extracts:

It also removes:


👉 Interview Answer

Parsing turns raw HTML into a normalized document representation. This step extracts text, links, metadata, language, and canonical signals so downstream indexing and ranking do not need to reason over messy raw pages.


7️⃣ Deduplication and Canonicalization

Duplicate examples:

Techniques:


8️⃣ Index Construction

The inverted index maps:

term → list of documents containing the term

Additional data:


9️⃣ Freshness Strategy

Freshness differs by document type:


👉 Interview Answer

Freshness should be selective. A production search engine does not refresh every page equally. It spends crawl and indexing resources where changes matter most to users.


🔟 Staff-Level Trade-offs

Decision Benefit Cost
Crawl more pages Better coverage Higher network and compute cost
Crawl important pages often Better freshness Less budget for long tail
Aggressive dedup Cleaner index Risk of dropping useful variants
Near-realtime indexing Fresh results More pipeline complexity
Heavy quality filtering Better search quality Risk of false negatives

中文部分

中文速记

一句话

Google Indexing Pipeline 是把混乱、重复、不断变化的网页,转成干净、可查询、可排序的 inverted index。


背诵要点


中文面试回答

我会把 Google 搜索索引管道拆成 URL discovery、crawl frontier、fetcher、parser、dedup、document store 和 indexer。 Crawl frontier 是一个优先级调度器,它根据页面重要性、变化频率、上次抓取时间、domain crawl budget 和 robots 规则决定下一批抓取 URL。

抓取完成后,parser 会从 HTML 里提取 title、正文、链接、metadata、语言和 canonical URL。 然后系统要做 URL normalization、content fingerprint 和 duplicate detection,避免同一内容以多个 URL 进入索引。 最后 indexer 构建 inverted index,并存储 freshness、quality、language、entity 和 ranking features。

Staff 级重点是:搜索引擎不能平等地抓取所有网页。 资源有限,所以必须在 coverage、freshness、quality 和 cost 之间做取舍,把预算花在最有用户价值的页面上。


✅ Final Interview Answer

A Google-like indexing pipeline starts with URL discovery and a crawl frontier. The frontier prioritizes URLs using importance, freshness need, domain budget, and quality signals. Fetchers download pages while respecting politeness rules. Parsers extract text, links, metadata, language, and canonical information. Then the system deduplicates content, stores normalized documents, and builds inverted indexes plus ranking features for query serving.

At staff level, I would emphasize the trade-off between coverage and freshness. The system cannot crawl and re-index everything all the time. It must spend resources where user value is highest while filtering duplicates, spam, and low-quality content.

Implement