🎯 Design Web Crawler
1️⃣ Core Framework
When discussing Web Crawler design, I frame it as:
- URL discovery
- URL scheduling
- Fetching web pages
- Parsing and extracting links
- Deduplication
- Storage and indexing
- Crawl politeness and rate limiting
- Distributed scaling and fault tolerance
2️⃣ Core Requirements
Functional Requirements
- Crawl web pages continuously
- Discover new URLs
- Extract links and metadata
- Avoid duplicate crawling
- Respect robots.txt
- Support scheduling and recrawl
- Store crawled content
- Support large-scale distributed crawling
Non-functional Requirements
- High throughput
- Fault tolerance
- Scalable distributed workers
- Low duplicate rate
- Efficient storage usage
- Fairness toward websites
- Eventual consistency acceptable
- Ability to resume after failures
👉 Interview Answer
A web crawler continuously discovers, fetches, parses, and stores web content.
The core challenge is scaling URL discovery and fetching while avoiding duplicate work, respecting politeness rules, and maintaining high throughput.
3️⃣ Main APIs
Submit Seed URLs
POST /api/crawl/seeds
Request:
{
"urls": [
"https://example.com",
"https://news.site.com"
]
}
Query Crawl Status
GET /api/crawl/status?url=https://example.com
Search Indexed Content
GET /api/search?q=distributed+systems
👉 Interview Answer
I would expose APIs for seed submission, crawl monitoring, and indexed content querying.
Internally, the crawler operates as a distributed asynchronous pipeline.
4️⃣ High-Level Architecture
Seed URLs
→ URL Frontier
→ Scheduler
→ Crawl Workers
→ HTML Parser
→ URL Extractor
→ Deduplication
→ Storage / Index
Metadata
→ Crawl Database
robots.txt Cache
→ Politeness Controller
Main Components
URL Frontier
- Stores pending URLs
- Prioritizes crawl order
- Prevents starvation
Scheduler
- Assigns URLs to workers
- Applies politeness rules
- Controls crawl rate
Crawl Workers
- Fetch web pages
- Handle retries
- Parse responses
Parser / Extractor
- Extract links
- Extract metadata
- Normalize URLs
👉 Interview Answer
I would separate URL scheduling, page fetching, parsing, deduplication, and storage into independent components.
This allows the crawler to scale horizontally and isolate failures.
5️⃣ URL Discovery
Sources of URLs
- Seed URLs
- Links extracted from pages
- Sitemaps
- RSS feeds
- User submissions
URL Extraction Flow
Fetch HTML
→ Parse DOM
→ Extract hyperlinks
→ Normalize URLs
→ Deduplicate
→ Push into frontier
URL Normalization
Examples:
HTTP → HTTPS normalization
Remove fragments (#)
Canonicalize query params
Lowercase hostname
Remove duplicate slashes
👉 Interview Answer
The crawler discovers URLs recursively from fetched pages.
Extracted URLs should be normalized and deduplicated before entering the crawl frontier, otherwise duplicates can explode rapidly.
6️⃣ URL Frontier and Scheduling
Why URL Frontier Matters
The frontier controls:
- Crawl order
- Priority
- Freshness
- Fairness
- Throughput
Common Scheduling Strategies
FIFO
Simple breadth-first crawl.
Priority-based
Higher priority for:
- Popular pages
- Fresh content
- High PageRank
- Frequently updated sites
Host-based Scheduling
Avoid overloading a single domain.
Example
example.com → crawl every 5 seconds
news.com → crawl every 500ms
👉 Interview Answer
The URL frontier is the heart of the crawler.
It determines crawl order, enforces fairness, and balances freshness versus throughput.
Host-based scheduling is important to avoid overwhelming websites.
7️⃣ Fetching Pages
Crawl Worker Flow
Get URL from frontier
→ Check robots.txt
→ DNS lookup
→ HTTP fetch
→ Handle redirects
→ Store response
→ Parse content
HTTP Concerns
- Redirects
- Timeouts
- Compression
- Retries
- TLS handling
- User-Agent management
Retry Strategy
Exponential backoff
Retry only transient failures
Limit retry count
👉 Interview Answer
Crawl workers should be lightweight and stateless.
They fetch pages asynchronously, handle retries and redirects, and pass successful responses to downstream parsers.
8️⃣ robots.txt and Politeness
Why Politeness Matters
Without controls, a crawler can overload websites.
robots.txt Example
User-agent: *
Disallow: /private/
Crawl-delay: 5
Politeness Rules
- Respect robots.txt
- Rate limit per domain
- Limit concurrent requests
- Randomize intervals
- Identify crawler via User-Agent
robots.txt Cache
domain → robots rules + expiration
👉 Interview Answer
A production crawler must respect robots.txt and implement politeness controls.
I would enforce per-domain rate limits and cache robots rules to reduce overhead.
9️⃣ Deduplication
Why Duplicates Happen
- Multiple links to same page
- URL aliases
- Tracking parameters
- Redirect chains
- Session IDs
Dedup Layers
URL-level Dedup
Prevent recrawling same normalized URL.
Content-level Dedup
Detect identical pages.
Example:
hash(page_content)
Tools
- Bloom filters
- Distributed hash sets
- Content fingerprinting
- SimHash / MinHash
👉 Interview Answer
Deduplication is critical at internet scale.
I would use normalized URLs, Bloom filters, and content hashing to reduce duplicate crawling and storage.
🔟 Parsing and Extraction
Extracted Data
- Links
- Title
- Metadata
- Structured data
- Images
- Keywords
Parsing Challenges
- Malformed HTML
- JavaScript-rendered pages
- Infinite calendars
- Duplicate templates
- Dynamic URLs
JavaScript-heavy Pages
Possible solutions:
Headless browser
Render service
Selective JS execution
👉 Interview Answer
Parsing converts fetched pages into structured data.
The crawler should extract links and metadata efficiently, while avoiding expensive full browser rendering whenever possible.
1️⃣1️⃣ Storage Design
What to Store
- Raw HTML
- Parsed metadata
- Link graph
- Crawl metadata
- Content fingerprints
Example Crawl Metadata
crawl_metadata (
url VARCHAR,
status_code INT,
crawled_at TIMESTAMP,
content_hash VARCHAR,
response_time_ms INT
)
Storage Types
Object Storage
For raw HTML.
Relational / NoSQL DB
For metadata.
Search Index
For query serving.
Graph Storage
For link graph analysis.
👉 Interview Answer
I would separate raw page storage, metadata storage, and search indexing.
Raw HTML is useful for replay and reprocessing, while metadata powers monitoring and search.
1️⃣2️⃣ Distributed Scaling
Why Distribution is Needed
The web is enormous.
Large crawlers process:
billions of URLs
Scaling Strategies
Partition by Host
worker group A → example.com
worker group B → news.com
Distributed Frontier
Sharded URL queues.
Stateless Workers
Easy horizontal scaling.
Async Fetching
Support many concurrent requests.
Queue Technologies
Kafka
Pulsar
SQS
Redis Streams
👉 Interview Answer
A large-scale crawler should use distributed URL queues, stateless workers, and asynchronous fetching.
Partitioning by host helps enforce politeness and improves cache locality.
1️⃣3️⃣ Freshness and Recrawling
Why Recrawl?
Pages change over time.
Recrawl Strategy
Factors:
- Page popularity
- Historical update frequency
- HTTP cache headers
- Content importance
Example
News homepage → every minute
Blog archive → once a week
Adaptive Crawling
Adjust frequency dynamically.
👉 Interview Answer
Web crawling is continuous, not one-time.
Frequently changing pages should be crawled more often, while static pages can be refreshed less frequently.
1️⃣4️⃣ Handling Bad Content
Common Problems
- Infinite URL spaces
- Spider traps
- Duplicate calendars
- Very large files
- Malicious responses
Protection Strategies
- Depth limits
- URL pattern filters
- Max response size
- Timeout limits
- MIME-type filtering
- Domain quotas
Spider Trap Example
calendar?page=1
calendar?page=2
calendar?page=3
...
👉 Interview Answer
Crawlers must defend against spider traps and unbounded URL generation.
I would apply URL heuristics, depth limits, and quota controls.
1️⃣5️⃣ Search Indexing
Search Pipeline
Fetched page
→ Parse text
→ Tokenize
→ Build inverted index
→ Store ranking metadata
Indexed Fields
- Title
- Body text
- Anchor text
- URL
- Metadata
Ranking Signals
- Link graph
- Freshness
- Content relevance
- Page authority
👉 Interview Answer
Crawled pages are typically transformed into a search index.
An inverted index enables fast keyword search, while ranking signals improve result quality.
1️⃣6️⃣ Failure Handling
Common Failures
- Worker crash
- DNS failure
- Network timeout
- robots.txt fetch failure
- Queue overload
- Duplicate crawling
- Storage outage
Strategies
- Retry with backoff
- Persistent queues
- Checkpointing
- Dead-letter queue
- Crawl resumption
- Idempotent processing
👉 Interview Answer
The crawler should assume failures are normal.
Persistent queues, retries, and checkpointing allow the system to recover gracefully.
1️⃣7️⃣ Consistency Model
Eventual Consistency is Acceptable
Because:
- The web constantly changes
- Some delay is unavoidable
- Full synchronization is impossible
Stronger Consistency Needed For
- Dedup metadata
- URL ownership
- Crawl scheduling state
👉 Interview Answer
Web crawlers are naturally eventually consistent.
Different workers may see different versions of the web, and freshness varies over time.
Strong consistency is mainly needed for scheduling and dedup coordination.
1️⃣8️⃣ Observability
Key Metrics
- URLs crawled per second
- Frontier queue size
- Fetch latency
- Success/error rate
- robots.txt violations
- Duplicate rate
- Storage growth
- Crawl freshness
- Retry rate
- DNS latency
Alerts
- Queue backlog spike
- Worker crash rate
- Excessive retries
- Domain overload
- Storage failures
👉 Interview Answer
I would monitor crawl throughput, queue backlog, fetch latency, duplicate rate, crawl freshness, retry rate, and worker failures.
These metrics show whether the crawler is healthy, efficient, and polite.
1️⃣9️⃣ End-to-End Flow
Crawl Flow
Seed URL submitted
→ URL frontier stores URL
→ Scheduler assigns crawl task
→ Worker fetches page
→ Parser extracts content and links
→ Deduplication filters duplicates
→ New URLs pushed into frontier
→ Content stored and indexed
Recrawl Flow
Page marked stale
→ Scheduler re-enqueues URL
→ Worker refetches content
→ Index updated
Key Insight
A web crawler is fundamentally a distributed URL scheduling and fetching system.
🧠 Staff-Level Answer (Final)
👉 Interview Answer (Full Version)
When designing a web crawler, I think of it as a distributed URL discovery, scheduling, fetching, parsing, and indexing pipeline.
Seed URLs enter a distributed URL frontier, where the scheduler prioritizes crawl order and enforces politeness constraints.
Crawl workers fetch pages asynchronously, respect robots.txt, handle retries and redirects, and pass responses to parsers.
Parsers extract links and metadata, normalize URLs, and push newly discovered URLs back into the frontier.
Deduplication is critical at scale, so I would use normalized URLs, Bloom filters, and content hashing to avoid duplicate crawling and storage.
Raw HTML, metadata, and the link graph should be stored separately, because they serve different downstream needs.
The crawler must continuously recrawl important pages, since the web changes constantly.
Distributed queues, stateless workers, asynchronous fetching, and host-based partitioning allow the system to scale horizontally.
Politeness is extremely important. The crawler must respect robots.txt, enforce per-domain rate limits, and avoid spider traps or abusive crawling behavior.
The main trade-offs are crawl freshness, coverage, throughput, storage cost, and operational complexity.
Ultimately, the goal is to discover, fetch, and index web content efficiently and responsibly at internet scale.
⭐ Final Insight
Web Crawler 的核心不是“下载网页”, 而是一个由 URL frontier、scheduler、distributed fetching、 deduplication、parsing 和 indexing 组成的大规模分布式系统。
中文部分
🎯 Design Web Crawler
1️⃣ 核心框架
设计 Web Crawler 时, 我通常从以下几个核心模块分析:
- URL discovery
- URL scheduling
- Page fetching
- Parsing 和 link extraction
- Deduplication
- Storage 和 indexing
- Politeness 和 rate limiting
- Distributed scaling 和 fault tolerance
2️⃣ 核心需求
功能需求
- 持续 crawl 网页
- 自动发现新 URLs
- 提取 links 和 metadata
- 避免重复抓取
- 遵守 robots.txt
- 支持 recrawl
- 存储 crawled content
- 支持大规模分布式 crawling
非功能需求
- 高吞吐
- Fault tolerance
- 可扩展 distributed workers
- 低 duplicate rate
- Efficient storage usage
- 对网站友好
- 最终一致即可
- 支持 failure recovery
👉 面试回答
Web crawler 会持续发现、 抓取、 解析和存储 web content。
核心挑战是如何在大规模下 高效发现 URL、 避免重复抓取、 并保持 polite crawling。
3️⃣ Main APIs
Submit Seed URLs
POST /api/crawl/seeds
Query Crawl Status
GET /api/crawl/status?url=https://example.com
Search Indexed Content
GET /api/search?q=distributed+systems
👉 面试回答
我会提供 seed submission、 crawl monitoring 和 indexed content querying APIs。
内部 crawler 本质上是一个 distributed async pipeline。
4️⃣ High-Level Architecture
Seed URLs
→ URL Frontier
→ Scheduler
→ Crawl Workers
→ HTML Parser
→ URL Extractor
→ Deduplication
→ Storage / Index
Main Components
URL Frontier
- 存储待抓取 URLs
- 管理 crawl priority
- 防止 starvation
Scheduler
- 分配 crawl tasks
- 控制 politeness
- 控制 crawl rate
Crawl Workers
- Fetch pages
- Handle retries
- Parse responses
Parser / Extractor
- 提取 links
- 提取 metadata
- Normalize URLs
👉 面试回答
我会将 scheduling、 fetching、 parsing、 deduplication 和 storage 解耦。
这样系统更容易水平扩展, 也能更好隔离 failures。
5️⃣ URL Discovery
URL 来源
- Seed URLs
- 页面中的 links
- Sitemap
- RSS feeds
- 用户提交
URL Extraction Flow
Fetch HTML
→ Parse DOM
→ Extract hyperlinks
→ Normalize URLs
→ Deduplicate
→ Push into frontier
URL Normalization
例如:
HTTP → HTTPS normalization
Remove fragments
Canonicalize query params
Lowercase hostname
👉 面试回答
Crawler 会递归发现 URLs。
新 URLs 必须 normalize 和 deduplicate, 否则 duplicates 会指数级增长。
6️⃣ URL Frontier 和 Scheduling
为什么 Frontier 很重要?
它决定:
- Crawl order
- Priority
- Freshness
- Fairness
- Throughput
Scheduling Strategies
FIFO
简单 breadth-first crawl。
Priority-based
优先抓:
- 热门页面
- 高频更新页面
- 高 PageRank 页面
Host-based Scheduling
避免压垮单一网站。
👉 面试回答
URL frontier 是 crawler 的核心。
它决定抓取顺序, 控制 fairness, 并平衡 freshness 和 throughput。
7️⃣ Fetching Pages
Worker Flow
Get URL from frontier
→ Check robots.txt
→ DNS lookup
→ HTTP fetch
→ Handle redirects
→ Parse response
HTTP Concerns
- Redirects
- Timeouts
- Compression
- Retries
- TLS handling
Retry Strategy
Exponential backoff
👉 面试回答
Crawl workers 应该尽量 stateless。
它们异步抓取 pages, 处理 retries 和 redirects, 然后将结果交给 parser。
8️⃣ robots.txt 和 Politeness
为什么重要?
否则 crawler 可能攻击网站。
robots.txt Example
User-agent: *
Disallow: /private/
Politeness Controls
- Respect robots.txt
- Per-domain rate limit
- Limit concurrent requests
- Randomized intervals
👉 面试回答
Production crawler 必须遵守 robots.txt。
我会实现 per-domain rate limiting 和 robots cache。
9️⃣ Deduplication
Duplicate 来源
- URL aliases
- Tracking params
- Redirect chains
- Session IDs
Dedup Types
URL-level
避免重复抓同一 URL。
Content-level
识别相同内容。
Tools
- Bloom filter
- Content hashing
- SimHash
👉 面试回答
大规模 crawler 中 dedup 非常关键。
我会结合 normalized URL、 Bloom filter 和 content hashing。
🔟 Parsing 和 Extraction
提取内容
- Links
- Title
- Metadata
- Images
- Structured data
Challenges
- Broken HTML
- JS-rendered pages
- Infinite calendars
- Dynamic URLs
JS-heavy Pages
方案:
Headless browser
Selective rendering
👉 面试回答
Parsing 会把 HTML 转成结构化数据。
对 JS-heavy pages, 我会尽量 selective rendering, 避免所有页面都跑 headless browser。
1️⃣1️⃣ Storage Design
存储内容
- Raw HTML
- Metadata
- Link graph
- Crawl metadata
Storage Choices
Object Storage
保存 raw HTML。
Metadata DB
保存 crawl 状态。
Search Index
支持搜索。
Graph Storage
支持 link analysis。
👉 面试回答
Raw HTML、 metadata、 search index 和 link graph 应该分离存储。
1️⃣2️⃣ Distributed Scaling
为什么需要分布式?
Web 太大了。
Scaling Strategies
Distributed Frontier
分片 URL queues。
Stateless Workers
方便 horizontal scaling。
Async Fetching
支持大量并发 requests。
Host Partitioning
按 domain 分片。
👉 面试回答
大规模 crawler 会使用 distributed queues、 stateless workers 和 async fetching。
Host partitioning 有助于 politeness。
1️⃣3️⃣ Freshness 和 Recrawling
为什么需要 Recrawl?
网页会变化。
Factors
- 更新频率
- Popularity
- Importance
Example
News homepage → every minute
Archive page → weekly
👉 面试回答
Web crawling 是持续过程, 不是一次性任务。
高频变化页面需要更频繁 recrawl。
1️⃣4️⃣ Spider Traps 和 Bad Content
常见问题
- Infinite URLs
- Calendar loops
- Very large files
- Malicious pages
Protection
- Depth limits
- URL filters
- Max response size
- Domain quotas
👉 面试回答
Spider traps 会导致 crawler 无限循环。
我会使用 URL heuristics、 depth limits 和 quotas 防护。
1️⃣5️⃣ Search Indexing
Search Flow
Fetched page
→ Parse text
→ Tokenize
→ Build inverted index
Ranking Signals
- Link graph
- Freshness
- Authority
👉 面试回答
Crawled pages 通常会进入 search index。
Inverted index 支持 fast keyword search。
1️⃣6️⃣ Failure Handling
Common Failures
- Worker crash
- DNS timeout
- Queue overload
- Storage outage
Recovery
- Retry
- Persistent queue
- Checkpointing
- Crawl resumption
👉 面试回答
Failures 是正常情况。
Persistent queues、 retries 和 checkpointing 能帮助 crawler 恢复。
1️⃣7️⃣ Observability
Key Metrics
- URLs crawled/sec
- Queue size
- Fetch latency
- Error rate
- Duplicate rate
- Crawl freshness
👉 面试回答
我会监控 throughput、 queue backlog、 duplicate rate、 retry rate 和 freshness。
🧠 Staff-Level Answer(最终版)
👉 面试回答(完整背诵版)
在设计 Web Crawler 时, 我会把它看作一个 distributed URL discovery、 scheduling、 fetching、 parsing 和 indexing pipeline。
Seed URLs 进入 distributed URL frontier, scheduler 决定 crawl order 并控制 politeness。
Crawl workers 异步抓取 pages, 遵守 robots.txt, 处理 retries 和 redirects, 并将 responses 交给 parsers。
Parsers 提取 links 和 metadata, normalize URLs, 然后把新发现 URLs 放回 frontier。
Deduplication 在互联网规模下非常关键, 我会结合 normalized URLs、 Bloom filters 和 content hashing。
Raw HTML、 metadata 和 search index 应该分离存储。
系统需要持续 recrawl pages, 因为 web content 会不断变化。
Distributed queues、 stateless workers、 async fetching 和 host partitioning 可以帮助 crawler 水平扩展。
最重要的是 politeness。 系统必须遵守 robots.txt, 实现 per-domain rate limiting, 并避免 spider traps。
核心 trade-offs 包括 freshness、 coverage、 throughput、 storage cost 和 operational complexity。
⭐ Final Insight
Web Crawler 的核心不是“下载网页”, 而是一个由 URL frontier、scheduler、 distributed fetching、deduplication、parsing 和 indexing 组成的大规模分布式系统。
Implement