d&d-t System Design Deep Dive ·

🎯 Design Web Crawler

1️⃣ Core Framework

When discussing Web Crawler design, I frame it as:

URL discovery
URL scheduling
Fetching web pages
Parsing and extracting links
Deduplication
Storage and indexing
Crawl politeness and rate limiting
Distributed scaling and fault tolerance

2️⃣ Core Requirements

Functional Requirements

Crawl web pages continuously
Discover new URLs
Extract links and metadata
Avoid duplicate crawling
Respect robots.txt
Support scheduling and recrawl
Store crawled content
Support large-scale distributed crawling

Non-functional Requirements

High throughput
Fault tolerance
Scalable distributed workers
Low duplicate rate
Efficient storage usage
Fairness toward websites
Eventual consistency acceptable
Ability to resume after failures

👉 Interview Answer

A web crawler continuously discovers, fetches, parses, and stores web content.

The core challenge is scaling URL discovery and fetching while avoiding duplicate work, respecting politeness rules, and maintaining high throughput.

3️⃣ Main APIs

Submit Seed URLs

POST /api/crawl/seeds

Request:

{
  "urls": [
    "https://example.com",
    "https://news.site.com"
  ]
}

Query Crawl Status

GET /api/crawl/status?url=https://example.com

Search Indexed Content

GET /api/search?q=distributed+systems

👉 Interview Answer

I would expose APIs for seed submission, crawl monitoring, and indexed content querying.

Internally, the crawler operates as a distributed asynchronous pipeline.

4️⃣ High-Level Architecture

Seed URLs
→ URL Frontier
→ Scheduler
→ Crawl Workers
→ HTML Parser
→ URL Extractor
→ Deduplication
→ Storage / Index

Metadata
→ Crawl Database

robots.txt Cache
→ Politeness Controller

Main Components

URL Frontier

Stores pending URLs
Prioritizes crawl order
Prevents starvation

Scheduler

Assigns URLs to workers
Applies politeness rules
Controls crawl rate

Crawl Workers

Fetch web pages
Handle retries
Parse responses

Parser / Extractor

Extract links
Extract metadata
Normalize URLs

👉 Interview Answer

I would separate URL scheduling, page fetching, parsing, deduplication, and storage into independent components.

This allows the crawler to scale horizontally and isolate failures.

5️⃣ URL Discovery

Sources of URLs

Seed URLs
Links extracted from pages
Sitemaps
RSS feeds
User submissions

URL Extraction Flow

Fetch HTML
→ Parse DOM
→ Extract hyperlinks
→ Normalize URLs
→ Deduplicate
→ Push into frontier

URL Normalization

Examples:

HTTP → HTTPS normalization
Remove fragments (#)
Canonicalize query params
Lowercase hostname
Remove duplicate slashes

👉 Interview Answer

The crawler discovers URLs recursively from fetched pages.

Extracted URLs should be normalized and deduplicated before entering the crawl frontier, otherwise duplicates can explode rapidly.

6️⃣ URL Frontier and Scheduling

Why URL Frontier Matters

The frontier controls:

Crawl order
Priority
Freshness
Fairness
Throughput

Common Scheduling Strategies

FIFO

Simple breadth-first crawl.

Priority-based

Higher priority for:

Popular pages
Fresh content
High PageRank
Frequently updated sites

Host-based Scheduling

Avoid overloading a single domain.

Example

example.com → crawl every 5 seconds
news.com → crawl every 500ms

👉 Interview Answer

The URL frontier is the heart of the crawler.

It determines crawl order, enforces fairness, and balances freshness versus throughput.

Host-based scheduling is important to avoid overwhelming websites.

7️⃣ Fetching Pages

Crawl Worker Flow

Get URL from frontier
→ Check robots.txt
→ DNS lookup
→ HTTP fetch
→ Handle redirects
→ Store response
→ Parse content

HTTP Concerns

Redirects
Timeouts
Compression
Retries
TLS handling
User-Agent management

Retry Strategy

Exponential backoff
Retry only transient failures
Limit retry count

👉 Interview Answer

Crawl workers should be lightweight and stateless.

They fetch pages asynchronously, handle retries and redirects, and pass successful responses to downstream parsers.

8️⃣ robots.txt and Politeness

Why Politeness Matters

Without controls, a crawler can overload websites.

robots.txt Example

User-agent: *
Disallow: /private/
Crawl-delay: 5

Politeness Rules

Respect robots.txt
Rate limit per domain
Limit concurrent requests
Randomize intervals
Identify crawler via User-Agent

robots.txt Cache

domain → robots rules + expiration

👉 Interview Answer

A production crawler must respect robots.txt and implement politeness controls.

I would enforce per-domain rate limits and cache robots rules to reduce overhead.

9️⃣ Deduplication

Why Duplicates Happen

Multiple links to same page
URL aliases
Tracking parameters
Redirect chains
Session IDs

Dedup Layers

URL-level Dedup

Prevent recrawling same normalized URL.

Content-level Dedup

Detect identical pages.

Example:

hash(page_content)

Tools

Bloom filters
Distributed hash sets
Content fingerprinting
SimHash / MinHash

👉 Interview Answer

Deduplication is critical at internet scale.

I would use normalized URLs, Bloom filters, and content hashing to reduce duplicate crawling and storage.

🔟 Parsing and Extraction

Extracted Data

Links
Title
Metadata
Structured data
Images
Keywords

Parsing Challenges

Malformed HTML
JavaScript-rendered pages
Infinite calendars
Duplicate templates
Dynamic URLs

JavaScript-heavy Pages

Possible solutions:

Headless browser
Render service
Selective JS execution

👉 Interview Answer

Parsing converts fetched pages into structured data.

The crawler should extract links and metadata efficiently, while avoiding expensive full browser rendering whenever possible.

1️⃣1️⃣ Storage Design

What to Store

Raw HTML
Parsed metadata
Link graph
Crawl metadata
Content fingerprints

Example Crawl Metadata

crawl_metadata (
  url VARCHAR,
  status_code INT,
  crawled_at TIMESTAMP,
  content_hash VARCHAR,
  response_time_ms INT
)

Storage Types

Object Storage

For raw HTML.

Relational / NoSQL DB

For metadata.

Search Index

For query serving.

Graph Storage

For link graph analysis.

👉 Interview Answer

I would separate raw page storage, metadata storage, and search indexing.

Raw HTML is useful for replay and reprocessing, while metadata powers monitoring and search.

1️⃣2️⃣ Distributed Scaling

Why Distribution is Needed

The web is enormous.

Large crawlers process:

billions of URLs

Scaling Strategies

Partition by Host

worker group A → example.com
worker group B → news.com

Distributed Frontier

Sharded URL queues.

Stateless Workers

Easy horizontal scaling.

Async Fetching

Support many concurrent requests.

Queue Technologies

Kafka
Pulsar
SQS
Redis Streams

👉 Interview Answer

A large-scale crawler should use distributed URL queues, stateless workers, and asynchronous fetching.

Partitioning by host helps enforce politeness and improves cache locality.

1️⃣3️⃣ Freshness and Recrawling

Why Recrawl?

Pages change over time.

Recrawl Strategy

Factors:

Page popularity
Historical update frequency
HTTP cache headers
Content importance

Example

News homepage → every minute
Blog archive → once a week

Adaptive Crawling

Adjust frequency dynamically.

👉 Interview Answer

Web crawling is continuous, not one-time.

Frequently changing pages should be crawled more often, while static pages can be refreshed less frequently.

1️⃣4️⃣ Handling Bad Content

Common Problems

Infinite URL spaces
Spider traps
Duplicate calendars
Very large files
Malicious responses

Protection Strategies

Depth limits
URL pattern filters
Max response size
Timeout limits
MIME-type filtering
Domain quotas

Spider Trap Example

calendar?page=1
calendar?page=2
calendar?page=3
...

👉 Interview Answer

Crawlers must defend against spider traps and unbounded URL generation.

I would apply URL heuristics, depth limits, and quota controls.

1️⃣5️⃣ Search Indexing

Search Pipeline

Fetched page
→ Parse text
→ Tokenize
→ Build inverted index
→ Store ranking metadata

Indexed Fields

Title
Body text
Anchor text
URL
Metadata

Ranking Signals

Link graph
Freshness
Content relevance
Page authority

👉 Interview Answer

Crawled pages are typically transformed into a search index.

An inverted index enables fast keyword search, while ranking signals improve result quality.

1️⃣6️⃣ Failure Handling

Common Failures

Worker crash
DNS failure
Network timeout
robots.txt fetch failure
Queue overload
Duplicate crawling
Storage outage

Strategies

Retry with backoff
Persistent queues
Checkpointing
Dead-letter queue
Crawl resumption
Idempotent processing

👉 Interview Answer

The crawler should assume failures are normal.

Persistent queues, retries, and checkpointing allow the system to recover gracefully.

1️⃣7️⃣ Consistency Model

Eventual Consistency is Acceptable

Because:

The web constantly changes
Some delay is unavoidable
Full synchronization is impossible

Stronger Consistency Needed For

Dedup metadata
URL ownership
Crawl scheduling state

👉 Interview Answer

Web crawlers are naturally eventually consistent.

Different workers may see different versions of the web, and freshness varies over time.

Strong consistency is mainly needed for scheduling and dedup coordination.

1️⃣8️⃣ Observability

Key Metrics

URLs crawled per second
Frontier queue size
Fetch latency
Success/error rate
robots.txt violations
Duplicate rate
Storage growth
Crawl freshness
Retry rate
DNS latency

Alerts

Queue backlog spike
Worker crash rate
Excessive retries
Domain overload
Storage failures

👉 Interview Answer

I would monitor crawl throughput, queue backlog, fetch latency, duplicate rate, crawl freshness, retry rate, and worker failures.

These metrics show whether the crawler is healthy, efficient, and polite.

1️⃣9️⃣ End-to-End Flow

Crawl Flow

Seed URL submitted
→ URL frontier stores URL
→ Scheduler assigns crawl task
→ Worker fetches page
→ Parser extracts content and links
→ Deduplication filters duplicates
→ New URLs pushed into frontier
→ Content stored and indexed

Recrawl Flow

Page marked stale
→ Scheduler re-enqueues URL
→ Worker refetches content
→ Index updated

Key Insight

A web crawler is fundamentally a distributed URL scheduling and fetching system.

🧠 Staff-Level Answer (Final)

👉 Interview Answer (Full Version)

When designing a web crawler, I think of it as a distributed URL discovery, scheduling, fetching, parsing, and indexing pipeline.

Seed URLs enter a distributed URL frontier, where the scheduler prioritizes crawl order and enforces politeness constraints.

Crawl workers fetch pages asynchronously, respect robots.txt, handle retries and redirects, and pass responses to parsers.

Parsers extract links and metadata, normalize URLs, and push newly discovered URLs back into the frontier.

Deduplication is critical at scale, so I would use normalized URLs, Bloom filters, and content hashing to avoid duplicate crawling and storage.

Raw HTML, metadata, and the link graph should be stored separately, because they serve different downstream needs.

The crawler must continuously recrawl important pages, since the web changes constantly.

Distributed queues, stateless workers, asynchronous fetching, and host-based partitioning allow the system to scale horizontally.

Politeness is extremely important. The crawler must respect robots.txt, enforce per-domain rate limits, and avoid spider traps or abusive crawling behavior.

The main trade-offs are crawl freshness, coverage, throughput, storage cost, and operational complexity.

Ultimately, the goal is to discover, fetch, and index web content efficiently and responsibly at internet scale.

⭐ Final Insight

Web Crawler 的核心不是“下载网页”，而是一个由 URL frontier、scheduler、distributed fetching、 deduplication、parsing 和 indexing 组成的大规模分布式系统。

中文部分

🎯 Design Web Crawler

1️⃣ 核心框架

设计 Web Crawler 时，我通常从以下几个核心模块分析：

URL discovery
URL scheduling
Page fetching
Parsing 和 link extraction
Deduplication
Storage 和 indexing
Politeness 和 rate limiting
Distributed scaling 和 fault tolerance

2️⃣ 核心需求

功能需求

持续 crawl 网页
自动发现新 URLs
提取 links 和 metadata
避免重复抓取
遵守 robots.txt
支持 recrawl
存储 crawled content
支持大规模分布式 crawling

非功能需求

高吞吐
Fault tolerance
可扩展 distributed workers
低 duplicate rate
Efficient storage usage
对网站友好
最终一致即可
支持 failure recovery

👉 面试回答

Web crawler 会持续发现、抓取、解析和存储 web content。

核心挑战是如何在大规模下高效发现 URL、避免重复抓取、并保持 polite crawling。

3️⃣ Main APIs

Submit Seed URLs

POST /api/crawl/seeds

Query Crawl Status

GET /api/crawl/status?url=https://example.com

Search Indexed Content

GET /api/search?q=distributed+systems

👉 面试回答

我会提供 seed submission、 crawl monitoring 和 indexed content querying APIs。

内部 crawler 本质上是一个 distributed async pipeline。

4️⃣ High-Level Architecture

Seed URLs
→ URL Frontier
→ Scheduler
→ Crawl Workers
→ HTML Parser
→ URL Extractor
→ Deduplication
→ Storage / Index

Main Components

URL Frontier

存储待抓取 URLs
管理 crawl priority
防止 starvation

Scheduler

分配 crawl tasks
控制 politeness
控制 crawl rate

Crawl Workers

Fetch pages
Handle retries
Parse responses

Parser / Extractor

提取 links
提取 metadata
Normalize URLs

👉 面试回答

我会将 scheduling、 fetching、 parsing、 deduplication 和 storage 解耦。

这样系统更容易水平扩展，也能更好隔离 failures。

5️⃣ URL Discovery

URL 来源

Seed URLs
页面中的 links
Sitemap
RSS feeds
用户提交

URL Extraction Flow

Fetch HTML
→ Parse DOM
→ Extract hyperlinks
→ Normalize URLs
→ Deduplicate
→ Push into frontier

URL Normalization

例如：

HTTP → HTTPS normalization
Remove fragments
Canonicalize query params
Lowercase hostname

👉 面试回答

Crawler 会递归发现 URLs。

新 URLs 必须 normalize 和 deduplicate，否则 duplicates 会指数级增长。

6️⃣ URL Frontier 和 Scheduling

为什么 Frontier 很重要？

它决定：

Crawl order
Priority
Freshness
Fairness
Throughput

Scheduling Strategies

FIFO

简单 breadth-first crawl。

Priority-based

优先抓：

热门页面
高频更新页面
高 PageRank 页面

Host-based Scheduling

避免压垮单一网站。

👉 面试回答

URL frontier 是 crawler 的核心。

它决定抓取顺序，控制 fairness，并平衡 freshness 和 throughput。

7️⃣ Fetching Pages

Worker Flow

Get URL from frontier
→ Check robots.txt
→ DNS lookup
→ HTTP fetch
→ Handle redirects
→ Parse response

HTTP Concerns

Redirects
Timeouts
Compression
Retries
TLS handling

Retry Strategy

Exponential backoff

👉 面试回答

Crawl workers 应该尽量 stateless。

它们异步抓取 pages，处理 retries 和 redirects，然后将结果交给 parser。

8️⃣ robots.txt 和 Politeness

为什么重要？

否则 crawler 可能攻击网站。

robots.txt Example

User-agent: *
Disallow: /private/

Politeness Controls

Respect robots.txt
Per-domain rate limit
Limit concurrent requests
Randomized intervals

👉 面试回答

Production crawler 必须遵守 robots.txt。

我会实现 per-domain rate limiting 和 robots cache。

9️⃣ Deduplication

Duplicate 来源

URL aliases
Tracking params
Redirect chains
Session IDs

Dedup Types

URL-level

避免重复抓同一 URL。

Content-level

识别相同内容。

Tools

Bloom filter
Content hashing
SimHash

👉 面试回答

大规模 crawler 中 dedup 非常关键。

我会结合 normalized URL、 Bloom filter 和 content hashing。

🔟 Parsing 和 Extraction

提取内容

Links
Title
Metadata
Images
Structured data

Challenges

Broken HTML
JS-rendered pages
Infinite calendars
Dynamic URLs

JS-heavy Pages

方案：

Headless browser
Selective rendering

👉 面试回答

Parsing 会把 HTML 转成结构化数据。

对 JS-heavy pages，我会尽量 selective rendering，避免所有页面都跑 headless browser。

1️⃣1️⃣ Storage Design

存储内容

Raw HTML
Metadata
Link graph
Crawl metadata

Storage Choices

Object Storage

保存 raw HTML。

Metadata DB

保存 crawl 状态。

Search Index

支持搜索。

Graph Storage

支持 link analysis。

👉 面试回答

Raw HTML、 metadata、 search index 和 link graph 应该分离存储。

1️⃣2️⃣ Distributed Scaling

为什么需要分布式？

Web 太大了。

Scaling Strategies

Distributed Frontier

分片 URL queues。

Stateless Workers

方便 horizontal scaling。

Async Fetching

支持大量并发 requests。

Host Partitioning

按 domain 分片。

👉 面试回答

大规模 crawler 会使用 distributed queues、 stateless workers 和 async fetching。

Host partitioning 有助于 politeness。

1️⃣3️⃣ Freshness 和 Recrawling

为什么需要 Recrawl？

网页会变化。

Factors

更新频率
Popularity
Importance

Example

News homepage → every minute
Archive page → weekly

👉 面试回答

Web crawling 是持续过程，不是一次性任务。

高频变化页面需要更频繁 recrawl。

1️⃣4️⃣ Spider Traps 和 Bad Content

常见问题

Infinite URLs
Calendar loops
Very large files
Malicious pages

Protection

Depth limits
URL filters
Max response size
Domain quotas

👉 面试回答

Spider traps 会导致 crawler 无限循环。

我会使用 URL heuristics、 depth limits 和 quotas 防护。

1️⃣5️⃣ Search Indexing

Search Flow

Fetched page
→ Parse text
→ Tokenize
→ Build inverted index

Ranking Signals

Link graph
Freshness
Authority

👉 面试回答

Crawled pages 通常会进入 search index。

Inverted index 支持 fast keyword search。

1️⃣6️⃣ Failure Handling

Common Failures

Worker crash
DNS timeout
Queue overload
Storage outage

Recovery

Retry
Persistent queue
Checkpointing
Crawl resumption

👉 面试回答

Failures 是正常情况。

Persistent queues、 retries 和 checkpointing 能帮助 crawler 恢复。

1️⃣7️⃣ Observability

Key Metrics

URLs crawled/sec
Queue size
Fetch latency
Error rate
Duplicate rate
Crawl freshness

👉 面试回答

我会监控 throughput、 queue backlog、 duplicate rate、 retry rate 和 freshness。

🧠 Staff-Level Answer（最终版）

👉 面试回答（完整背诵版）

在设计 Web Crawler 时，我会把它看作一个 distributed URL discovery、 scheduling、 fetching、 parsing 和 indexing pipeline。

Seed URLs 进入 distributed URL frontier， scheduler 决定 crawl order 并控制 politeness。

Crawl workers 异步抓取 pages，遵守 robots.txt，处理 retries 和 redirects，并将 responses 交给 parsers。

Parsers 提取 links 和 metadata， normalize URLs，然后把新发现 URLs 放回 frontier。

Deduplication 在互联网规模下非常关键，我会结合 normalized URLs、 Bloom filters 和 content hashing。

Raw HTML、 metadata 和 search index 应该分离存储。

系统需要持续 recrawl pages，因为 web content 会不断变化。

Distributed queues、 stateless workers、 async fetching 和 host partitioning 可以帮助 crawler 水平扩展。

最重要的是 politeness。系统必须遵守 robots.txt，实现 per-domain rate limiting，并避免 spider traps。

核心 trade-offs 包括 freshness、 coverage、 throughput、 storage cost 和 operational complexity。

⭐ Final Insight

Web Crawler 的核心不是“下载网页”，而是一个由 URL frontier、scheduler、 distributed fetching、deduplication、parsing 和 indexing 组成的大规模分布式系统。