🎯 How Google Search Indexing Pipeline Works
1️⃣ Core Indexing Framework (Staff-Level)
When discussing a Google-like search indexing pipeline, I frame it as:
- URL discovery
- Crawl scheduling
- Fetching with politeness controls
- Parsing and content extraction
- Deduplication and canonicalization
- Index construction
- Freshness updates
- Trade-offs: coverage vs freshness vs quality vs cost
2️⃣ Core Problem
The web is huge, noisy, duplicated, and constantly changing.
The indexing system must decide:
- what to crawl
- when to crawl it
- how often to refresh it
- what content to trust
- how to turn pages into query-time indexes
👉 Interview Answer
A search indexing pipeline converts an uncontrolled web corpus into structured searchable indexes. The hard part is not just crawling pages. The hard part is prioritizing crawl budget, removing low-quality or duplicate content, and keeping important documents fresh without wasting resources.
3️⃣ High-Level Architecture
URL Discovery
↓
Crawl Frontier
↓
Fetcher
↓
Parser / Extractor
↓
Dedup / Canonicalization
↓
Document Store
↓
Indexer
↓
Inverted Index + Ranking Features
↓
Query Serving
4️⃣ Crawl Frontier
The crawl frontier decides what URL to fetch next.
Signals:
- page importance
- last crawl time
- expected change frequency
- domain crawl budget
- robots.txt rules
- historical quality
- discovered links
👉 Interview Answer
The crawl frontier is a priority scheduler. It balances page importance, freshness needs, domain politeness, and crawl budget. Important and frequently changing pages are crawled more often than low-value static pages.
5️⃣ Fetching and Politeness
Fetcher responsibilities:
- download page content
- respect robots.txt
- limit per-domain request rate
- handle redirects
- retry transient failures
- avoid crawler traps
6️⃣ Parsing and Extraction
The parser extracts:
- title
- body text
- links
- canonical URL
- structured data
- language
- metadata
- media references
It also removes:
- boilerplate
- navigation noise
- spam patterns
- malformed content
👉 Interview Answer
Parsing turns raw HTML into a normalized document representation. This step extracts text, links, metadata, language, and canonical signals so downstream indexing and ranking do not need to reason over messy raw pages.
7️⃣ Deduplication and Canonicalization
Duplicate examples:
- same page with tracking parameters
- print and mobile versions
- mirrored content
- copied content
- HTTP and HTTPS variants
Techniques:
- canonical tags
- URL normalization
- content fingerprints
- shingling / similarity hashing
- redirect consolidation
8️⃣ Index Construction
The inverted index maps:
term → list of documents containing the term
Additional data:
- term positions
- document quality features
- freshness timestamp
- language
- entity signals
- link signals
9️⃣ Freshness Strategy
Freshness differs by document type:
- news: minutes
- product pages: hours
- blogs: days
- static docs: weeks
👉 Interview Answer
Freshness should be selective. A production search engine does not refresh every page equally. It spends crawl and indexing resources where changes matter most to users.
🔟 Staff-Level Trade-offs
| Decision | Benefit | Cost |
|---|---|---|
| Crawl more pages | Better coverage | Higher network and compute cost |
| Crawl important pages often | Better freshness | Less budget for long tail |
| Aggressive dedup | Cleaner index | Risk of dropping useful variants |
| Near-realtime indexing | Fresh results | More pipeline complexity |
| Heavy quality filtering | Better search quality | Risk of false negatives |
中文部分
中文速记
一句话
Google Indexing Pipeline 是把混乱、重复、不断变化的网页,转成干净、可查询、可排序的 inverted index。
背诵要点
- crawl frontier 决定抓什么、什么时候抓
- crawl budget 要优先给重要且变化频繁的页面
- parser 把 HTML 转成 normalized document
- dedup/canonicalization 防止重复内容污染索引
- 核心权衡是 coverage vs freshness vs cost
中文面试回答
我会把 Google 搜索索引管道拆成 URL discovery、crawl frontier、fetcher、parser、dedup、document store 和 indexer。 Crawl frontier 是一个优先级调度器,它根据页面重要性、变化频率、上次抓取时间、domain crawl budget 和 robots 规则决定下一批抓取 URL。
抓取完成后,parser 会从 HTML 里提取 title、正文、链接、metadata、语言和 canonical URL。 然后系统要做 URL normalization、content fingerprint 和 duplicate detection,避免同一内容以多个 URL 进入索引。 最后 indexer 构建 inverted index,并存储 freshness、quality、language、entity 和 ranking features。
Staff 级重点是:搜索引擎不能平等地抓取所有网页。 资源有限,所以必须在 coverage、freshness、quality 和 cost 之间做取舍,把预算花在最有用户价值的页面上。
✅ Final Interview Answer
A Google-like indexing pipeline starts with URL discovery and a crawl frontier. The frontier prioritizes URLs using importance, freshness need, domain budget, and quality signals. Fetchers download pages while respecting politeness rules. Parsers extract text, links, metadata, language, and canonical information. Then the system deduplicates content, stores normalized documents, and builds inverted indexes plus ranking features for query serving.
At staff level, I would emphasize the trade-off between coverage and freshness. The system cannot crawl and re-index everything all the time. It must spend resources where user value is highest while filtering duplicates, spam, and low-quality content.
Implement