sd-rps Real Production Systems ·

🎯 How Google Search Indexing Pipeline Works

1️⃣ Core Indexing Framework (Staff-Level)

When discussing a Google-like search indexing pipeline, I frame it as:

URL discovery
Crawl scheduling
Fetching with politeness controls
Parsing and content extraction
Deduplication and canonicalization
Index construction
Freshness updates
Trade-offs: coverage vs freshness vs quality vs cost

2️⃣ Core Problem

The web is huge, noisy, duplicated, and constantly changing.

The indexing system must decide:

what to crawl
when to crawl it
how often to refresh it
what content to trust
how to turn pages into query-time indexes

👉 Interview Answer

A search indexing pipeline converts an uncontrolled web corpus into structured searchable indexes. The hard part is not just crawling pages. The hard part is prioritizing crawl budget, removing low-quality or duplicate content, and keeping important documents fresh without wasting resources.

3️⃣ High-Level Architecture

URL Discovery
        ↓
Crawl Frontier
        ↓
Fetcher
        ↓
Parser / Extractor
        ↓
Dedup / Canonicalization
        ↓
Document Store
        ↓
Indexer
        ↓
Inverted Index + Ranking Features
        ↓
Query Serving

4️⃣ Crawl Frontier

The crawl frontier decides what URL to fetch next.

Signals:

page importance
last crawl time
expected change frequency
domain crawl budget
robots.txt rules
historical quality
discovered links

👉 Interview Answer

The crawl frontier is a priority scheduler. It balances page importance, freshness needs, domain politeness, and crawl budget. Important and frequently changing pages are crawled more often than low-value static pages.

5️⃣ Fetching and Politeness

Fetcher responsibilities:

download page content
respect robots.txt
limit per-domain request rate
handle redirects
retry transient failures
avoid crawler traps

6️⃣ Parsing and Extraction

The parser extracts:

title
body text
links
canonical URL
structured data
language
metadata
media references

It also removes:

boilerplate
navigation noise
spam patterns
malformed content

👉 Interview Answer

Parsing turns raw HTML into a normalized document representation. This step extracts text, links, metadata, language, and canonical signals so downstream indexing and ranking do not need to reason over messy raw pages.

7️⃣ Deduplication and Canonicalization

Duplicate examples:

same page with tracking parameters
print and mobile versions
mirrored content
copied content
HTTP and HTTPS variants

Techniques:

canonical tags
URL normalization
content fingerprints
shingling / similarity hashing
redirect consolidation

8️⃣ Index Construction

The inverted index maps:

term → list of documents containing the term

Additional data:

term positions
document quality features
freshness timestamp
language
entity signals
link signals

9️⃣ Freshness Strategy

Freshness differs by document type:

news: minutes
product pages: hours
blogs: days
static docs: weeks

👉 Interview Answer

Freshness should be selective. A production search engine does not refresh every page equally. It spends crawl and indexing resources where changes matter most to users.

🔟 Staff-Level Trade-offs

Decision	Benefit	Cost
Crawl more pages	Better coverage	Higher network and compute cost
Crawl important pages often	Better freshness	Less budget for long tail
Aggressive dedup	Cleaner index	Risk of dropping useful variants
Near-realtime indexing	Fresh results	More pipeline complexity
Heavy quality filtering	Better search quality	Risk of false negatives

中文部分

中文速记

一句话

Google Indexing Pipeline 是把混乱、重复、不断变化的网页，转成干净、可查询、可排序的 inverted index。

背诵要点

crawl frontier 决定抓什么、什么时候抓
crawl budget 要优先给重要且变化频繁的页面
parser 把 HTML 转成 normalized document
dedup/canonicalization 防止重复内容污染索引
核心权衡是 coverage vs freshness vs cost

中文面试回答

我会把 Google 搜索索引管道拆成 URL discovery、crawl frontier、fetcher、parser、dedup、document store 和 indexer。 Crawl frontier 是一个优先级调度器，它根据页面重要性、变化频率、上次抓取时间、domain crawl budget 和 robots 规则决定下一批抓取 URL。

抓取完成后，parser 会从 HTML 里提取 title、正文、链接、metadata、语言和 canonical URL。然后系统要做 URL normalization、content fingerprint 和 duplicate detection，避免同一内容以多个 URL 进入索引。最后 indexer 构建 inverted index，并存储 freshness、quality、language、entity 和 ranking features。

Staff 级重点是：搜索引擎不能平等地抓取所有网页。资源有限，所以必须在 coverage、freshness、quality 和 cost 之间做取舍，把预算花在最有用户价值的页面上。

✅ Final Interview Answer

A Google-like indexing pipeline starts with URL discovery and a crawl frontier. The frontier prioritizes URLs using importance, freshness need, domain budget, and quality signals. Fetchers download pages while respecting politeness rules. Parsers extract text, links, metadata, language, and canonical information. Then the system deduplicates content, stores normalized documents, and builds inverted indexes plus ranking features for query serving.

At staff level, I would emphasize the trade-off between coverage and freshness. The system cannot crawl and re-index everything all the time. It must spend resources where user value is highest while filtering duplicates, spam, and low-quality content.

System Design Deep Dive - 07 How Google Search Indexing Pipeline Works

🎯 How Google Search Indexing Pipeline Works

1️⃣ Core Indexing Framework (Staff-Level)

2️⃣ Core Problem

3️⃣ High-Level Architecture

4️⃣ Crawl Frontier

5️⃣ Fetching and Politeness

6️⃣ Parsing and Extraction

7️⃣ Deduplication and Canonicalization

8️⃣ Index Construction

9️⃣ Freshness Strategy

🔟 Staff-Level Trade-offs

中文部分

中文速记

一句话

背诵要点

中文面试回答

✅ Final Interview Answer

Implement