System Design Deep Dive - 26 Design Web Crawler

Post by ailswan May. 19, 2026

中文 ↓

🎯 Design Web Crawler

1️⃣ Core Framework

When discussing Web Crawler design, I frame it as:

  1. URL discovery
  2. URL scheduling
  3. Fetching web pages
  4. Parsing and extracting links
  5. Deduplication
  6. Storage and indexing
  7. Crawl politeness and rate limiting
  8. Distributed scaling and fault tolerance

2️⃣ Core Requirements


Functional Requirements


Non-functional Requirements


👉 Interview Answer

A web crawler continuously discovers, fetches, parses, and stores web content.

The core challenge is scaling URL discovery and fetching while avoiding duplicate work, respecting politeness rules, and maintaining high throughput.


3️⃣ Main APIs


Submit Seed URLs

POST /api/crawl/seeds

Request:

{
  "urls": [
    "https://example.com",
    "https://news.site.com"
  ]
}

Query Crawl Status

GET /api/crawl/status?url=https://example.com

Search Indexed Content

GET /api/search?q=distributed+systems

👉 Interview Answer

I would expose APIs for seed submission, crawl monitoring, and indexed content querying.

Internally, the crawler operates as a distributed asynchronous pipeline.


4️⃣ High-Level Architecture


Seed URLs
→ URL Frontier
→ Scheduler
→ Crawl Workers
→ HTML Parser
→ URL Extractor
→ Deduplication
→ Storage / Index

Metadata
→ Crawl Database

robots.txt Cache
→ Politeness Controller

Main Components

URL Frontier


Scheduler


Crawl Workers


Parser / Extractor


👉 Interview Answer

I would separate URL scheduling, page fetching, parsing, deduplication, and storage into independent components.

This allows the crawler to scale horizontally and isolate failures.


5️⃣ URL Discovery


Sources of URLs


URL Extraction Flow

Fetch HTML
→ Parse DOM
→ Extract hyperlinks
→ Normalize URLs
→ Deduplicate
→ Push into frontier

URL Normalization

Examples:

HTTP → HTTPS normalization
Remove fragments (#)
Canonicalize query params
Lowercase hostname
Remove duplicate slashes

👉 Interview Answer

The crawler discovers URLs recursively from fetched pages.

Extracted URLs should be normalized and deduplicated before entering the crawl frontier, otherwise duplicates can explode rapidly.


6️⃣ URL Frontier and Scheduling


Why URL Frontier Matters

The frontier controls:


Common Scheduling Strategies

FIFO

Simple breadth-first crawl.


Priority-based

Higher priority for:


Host-based Scheduling

Avoid overloading a single domain.


Example

example.com → crawl every 5 seconds
news.com → crawl every 500ms

👉 Interview Answer

The URL frontier is the heart of the crawler.

It determines crawl order, enforces fairness, and balances freshness versus throughput.

Host-based scheduling is important to avoid overwhelming websites.


7️⃣ Fetching Pages


Crawl Worker Flow

Get URL from frontier
→ Check robots.txt
→ DNS lookup
→ HTTP fetch
→ Handle redirects
→ Store response
→ Parse content

HTTP Concerns


Retry Strategy

Exponential backoff
Retry only transient failures
Limit retry count

👉 Interview Answer

Crawl workers should be lightweight and stateless.

They fetch pages asynchronously, handle retries and redirects, and pass successful responses to downstream parsers.


8️⃣ robots.txt and Politeness


Why Politeness Matters

Without controls, a crawler can overload websites.


robots.txt Example

User-agent: *
Disallow: /private/
Crawl-delay: 5

Politeness Rules


robots.txt Cache

domain → robots rules + expiration

👉 Interview Answer

A production crawler must respect robots.txt and implement politeness controls.

I would enforce per-domain rate limits and cache robots rules to reduce overhead.


9️⃣ Deduplication


Why Duplicates Happen


Dedup Layers

URL-level Dedup

Prevent recrawling same normalized URL.


Content-level Dedup

Detect identical pages.

Example:

hash(page_content)

Tools


👉 Interview Answer

Deduplication is critical at internet scale.

I would use normalized URLs, Bloom filters, and content hashing to reduce duplicate crawling and storage.


🔟 Parsing and Extraction


Extracted Data


Parsing Challenges


JavaScript-heavy Pages

Possible solutions:

Headless browser
Render service
Selective JS execution

👉 Interview Answer

Parsing converts fetched pages into structured data.

The crawler should extract links and metadata efficiently, while avoiding expensive full browser rendering whenever possible.


1️⃣1️⃣ Storage Design


What to Store


Example Crawl Metadata

crawl_metadata (
  url VARCHAR,
  status_code INT,
  crawled_at TIMESTAMP,
  content_hash VARCHAR,
  response_time_ms INT
)

Storage Types

Object Storage

For raw HTML.


Relational / NoSQL DB

For metadata.


Search Index

For query serving.


Graph Storage

For link graph analysis.


👉 Interview Answer

I would separate raw page storage, metadata storage, and search indexing.

Raw HTML is useful for replay and reprocessing, while metadata powers monitoring and search.


1️⃣2️⃣ Distributed Scaling


Why Distribution is Needed

The web is enormous.

Large crawlers process:

billions of URLs

Scaling Strategies

Partition by Host

worker group A → example.com
worker group B → news.com

Distributed Frontier

Sharded URL queues.


Stateless Workers

Easy horizontal scaling.


Async Fetching

Support many concurrent requests.


Queue Technologies

Kafka
Pulsar
SQS
Redis Streams

👉 Interview Answer

A large-scale crawler should use distributed URL queues, stateless workers, and asynchronous fetching.

Partitioning by host helps enforce politeness and improves cache locality.


1️⃣3️⃣ Freshness and Recrawling


Why Recrawl?

Pages change over time.


Recrawl Strategy

Factors:


Example

News homepage → every minute
Blog archive → once a week

Adaptive Crawling

Adjust frequency dynamically.


👉 Interview Answer

Web crawling is continuous, not one-time.

Frequently changing pages should be crawled more often, while static pages can be refreshed less frequently.


1️⃣4️⃣ Handling Bad Content


Common Problems


Protection Strategies


Spider Trap Example

calendar?page=1
calendar?page=2
calendar?page=3
...

👉 Interview Answer

Crawlers must defend against spider traps and unbounded URL generation.

I would apply URL heuristics, depth limits, and quota controls.


1️⃣5️⃣ Search Indexing


Search Pipeline

Fetched page
→ Parse text
→ Tokenize
→ Build inverted index
→ Store ranking metadata

Indexed Fields


Ranking Signals


👉 Interview Answer

Crawled pages are typically transformed into a search index.

An inverted index enables fast keyword search, while ranking signals improve result quality.


1️⃣6️⃣ Failure Handling


Common Failures


Strategies


👉 Interview Answer

The crawler should assume failures are normal.

Persistent queues, retries, and checkpointing allow the system to recover gracefully.


1️⃣7️⃣ Consistency Model


Eventual Consistency is Acceptable

Because:


Stronger Consistency Needed For


👉 Interview Answer

Web crawlers are naturally eventually consistent.

Different workers may see different versions of the web, and freshness varies over time.

Strong consistency is mainly needed for scheduling and dedup coordination.


1️⃣8️⃣ Observability


Key Metrics


Alerts


👉 Interview Answer

I would monitor crawl throughput, queue backlog, fetch latency, duplicate rate, crawl freshness, retry rate, and worker failures.

These metrics show whether the crawler is healthy, efficient, and polite.


1️⃣9️⃣ End-to-End Flow


Crawl Flow

Seed URL submitted
→ URL frontier stores URL
→ Scheduler assigns crawl task
→ Worker fetches page
→ Parser extracts content and links
→ Deduplication filters duplicates
→ New URLs pushed into frontier
→ Content stored and indexed

Recrawl Flow

Page marked stale
→ Scheduler re-enqueues URL
→ Worker refetches content
→ Index updated

Key Insight

A web crawler is fundamentally a distributed URL scheduling and fetching system.


🧠 Staff-Level Answer (Final)


👉 Interview Answer (Full Version)

When designing a web crawler, I think of it as a distributed URL discovery, scheduling, fetching, parsing, and indexing pipeline.

Seed URLs enter a distributed URL frontier, where the scheduler prioritizes crawl order and enforces politeness constraints.

Crawl workers fetch pages asynchronously, respect robots.txt, handle retries and redirects, and pass responses to parsers.

Parsers extract links and metadata, normalize URLs, and push newly discovered URLs back into the frontier.

Deduplication is critical at scale, so I would use normalized URLs, Bloom filters, and content hashing to avoid duplicate crawling and storage.

Raw HTML, metadata, and the link graph should be stored separately, because they serve different downstream needs.

The crawler must continuously recrawl important pages, since the web changes constantly.

Distributed queues, stateless workers, asynchronous fetching, and host-based partitioning allow the system to scale horizontally.

Politeness is extremely important. The crawler must respect robots.txt, enforce per-domain rate limits, and avoid spider traps or abusive crawling behavior.

The main trade-offs are crawl freshness, coverage, throughput, storage cost, and operational complexity.

Ultimately, the goal is to discover, fetch, and index web content efficiently and responsibly at internet scale.


⭐ Final Insight

Web Crawler 的核心不是“下载网页”, 而是一个由 URL frontier、scheduler、distributed fetching、 deduplication、parsing 和 indexing 组成的大规模分布式系统。



中文部分


🎯 Design Web Crawler


1️⃣ 核心框架

设计 Web Crawler 时, 我通常从以下几个核心模块分析:

  1. URL discovery
  2. URL scheduling
  3. Page fetching
  4. Parsing 和 link extraction
  5. Deduplication
  6. Storage 和 indexing
  7. Politeness 和 rate limiting
  8. Distributed scaling 和 fault tolerance

2️⃣ 核心需求


功能需求


非功能需求


👉 面试回答

Web crawler 会持续发现、 抓取、 解析和存储 web content。

核心挑战是如何在大规模下 高效发现 URL、 避免重复抓取、 并保持 polite crawling。


3️⃣ Main APIs


Submit Seed URLs

POST /api/crawl/seeds

Query Crawl Status

GET /api/crawl/status?url=https://example.com

Search Indexed Content

GET /api/search?q=distributed+systems

👉 面试回答

我会提供 seed submission、 crawl monitoring 和 indexed content querying APIs。

内部 crawler 本质上是一个 distributed async pipeline。


4️⃣ High-Level Architecture


Seed URLs
→ URL Frontier
→ Scheduler
→ Crawl Workers
→ HTML Parser
→ URL Extractor
→ Deduplication
→ Storage / Index

Main Components

URL Frontier


Scheduler


Crawl Workers


Parser / Extractor


👉 面试回答

我会将 scheduling、 fetching、 parsing、 deduplication 和 storage 解耦。

这样系统更容易水平扩展, 也能更好隔离 failures。


5️⃣ URL Discovery


URL 来源


URL Extraction Flow

Fetch HTML
→ Parse DOM
→ Extract hyperlinks
→ Normalize URLs
→ Deduplicate
→ Push into frontier

URL Normalization

例如:

HTTP → HTTPS normalization
Remove fragments
Canonicalize query params
Lowercase hostname

👉 面试回答

Crawler 会递归发现 URLs。

新 URLs 必须 normalize 和 deduplicate, 否则 duplicates 会指数级增长。


6️⃣ URL Frontier 和 Scheduling


为什么 Frontier 很重要?

它决定:


Scheduling Strategies

FIFO

简单 breadth-first crawl。


Priority-based

优先抓:


Host-based Scheduling

避免压垮单一网站。


👉 面试回答

URL frontier 是 crawler 的核心。

它决定抓取顺序, 控制 fairness, 并平衡 freshness 和 throughput。


7️⃣ Fetching Pages


Worker Flow

Get URL from frontier
→ Check robots.txt
→ DNS lookup
→ HTTP fetch
→ Handle redirects
→ Parse response

HTTP Concerns


Retry Strategy

Exponential backoff

👉 面试回答

Crawl workers 应该尽量 stateless。

它们异步抓取 pages, 处理 retries 和 redirects, 然后将结果交给 parser。


8️⃣ robots.txt 和 Politeness


为什么重要?

否则 crawler 可能攻击网站。


robots.txt Example

User-agent: *
Disallow: /private/

Politeness Controls


👉 面试回答

Production crawler 必须遵守 robots.txt。

我会实现 per-domain rate limiting 和 robots cache。


9️⃣ Deduplication


Duplicate 来源


Dedup Types

URL-level

避免重复抓同一 URL。


Content-level

识别相同内容。


Tools


👉 面试回答

大规模 crawler 中 dedup 非常关键。

我会结合 normalized URL、 Bloom filter 和 content hashing。


🔟 Parsing 和 Extraction


提取内容


Challenges


JS-heavy Pages

方案:

Headless browser
Selective rendering

👉 面试回答

Parsing 会把 HTML 转成结构化数据。

对 JS-heavy pages, 我会尽量 selective rendering, 避免所有页面都跑 headless browser。


1️⃣1️⃣ Storage Design


存储内容


Storage Choices

Object Storage

保存 raw HTML。


Metadata DB

保存 crawl 状态。


Search Index

支持搜索。


Graph Storage

支持 link analysis。


👉 面试回答

Raw HTML、 metadata、 search index 和 link graph 应该分离存储。


1️⃣2️⃣ Distributed Scaling


为什么需要分布式?

Web 太大了。


Scaling Strategies

Distributed Frontier

分片 URL queues。


Stateless Workers

方便 horizontal scaling。


Async Fetching

支持大量并发 requests。


Host Partitioning

按 domain 分片。


👉 面试回答

大规模 crawler 会使用 distributed queues、 stateless workers 和 async fetching。

Host partitioning 有助于 politeness。


1️⃣3️⃣ Freshness 和 Recrawling


为什么需要 Recrawl?

网页会变化。


Factors


Example

News homepage → every minute
Archive page → weekly

👉 面试回答

Web crawling 是持续过程, 不是一次性任务。

高频变化页面需要更频繁 recrawl。


1️⃣4️⃣ Spider Traps 和 Bad Content


常见问题


Protection


👉 面试回答

Spider traps 会导致 crawler 无限循环。

我会使用 URL heuristics、 depth limits 和 quotas 防护。


1️⃣5️⃣ Search Indexing


Search Flow

Fetched page
→ Parse text
→ Tokenize
→ Build inverted index

Ranking Signals


👉 面试回答

Crawled pages 通常会进入 search index。

Inverted index 支持 fast keyword search。


1️⃣6️⃣ Failure Handling


Common Failures


Recovery


👉 面试回答

Failures 是正常情况。

Persistent queues、 retries 和 checkpointing 能帮助 crawler 恢复。


1️⃣7️⃣ Observability


Key Metrics


👉 面试回答

我会监控 throughput、 queue backlog、 duplicate rate、 retry rate 和 freshness。


🧠 Staff-Level Answer(最终版)


👉 面试回答(完整背诵版)

在设计 Web Crawler 时, 我会把它看作一个 distributed URL discovery、 scheduling、 fetching、 parsing 和 indexing pipeline。

Seed URLs 进入 distributed URL frontier, scheduler 决定 crawl order 并控制 politeness。

Crawl workers 异步抓取 pages, 遵守 robots.txt, 处理 retries 和 redirects, 并将 responses 交给 parsers。

Parsers 提取 links 和 metadata, normalize URLs, 然后把新发现 URLs 放回 frontier。

Deduplication 在互联网规模下非常关键, 我会结合 normalized URLs、 Bloom filters 和 content hashing。

Raw HTML、 metadata 和 search index 应该分离存储。

系统需要持续 recrawl pages, 因为 web content 会不断变化。

Distributed queues、 stateless workers、 async fetching 和 host partitioning 可以帮助 crawler 水平扩展。

最重要的是 politeness。 系统必须遵守 robots.txt, 实现 per-domain rate limiting, 并避免 spider traps。

核心 trade-offs 包括 freshness、 coverage、 throughput、 storage cost 和 operational complexity。


⭐ Final Insight

Web Crawler 的核心不是“下载网页”, 而是一个由 URL frontier、scheduler、 distributed fetching、deduplication、parsing 和 indexing 组成的大规模分布式系统。

Implement