🎯 How Twitter Handles Trending Topics
1️⃣ Core Trending Framework (Staff-Level)
When discussing a Twitter-like trending topics system, I frame it as:
- Tweet/event stream ingestion
- Entity and hashtag extraction
- Windowed counting
- Baseline comparison
- Spike detection
- Spam and manipulation filtering
- Regional and personalized ranking
- Trade-offs: freshness vs stability vs abuse resistance
2️⃣ Core Problem
Trending is not the same as popular.
A topic should be surfaced when it is:
- growing unusually fast
- relevant to a region or audience
- not mostly spam
- not a duplicate of another topic
- fresh enough to be interesting
👉 Interview Answer
A Twitter-like trending system is a realtime anomaly detection system over social activity streams. It detects unusual growth compared with historical baselines, then filters spam and ranks topics by relevance.
3️⃣ High-Level Architecture
Tweet Stream
↓
Entity / Hashtag Extraction
↓
Stream Aggregation
↓
Sliding Window Counters
↓
Baseline Comparison
↓
Spam and Quality Filters
↓
Regional Trend Ranking
↓
Trending Surface
4️⃣ Entity Extraction
Extract candidates:
- hashtags
- named entities
- phrases
- URLs
- events
- people or organizations
Normalize:
- case folding
- spelling variants
- duplicate hashtags
- entity linking
- language detection
5️⃣ Windowed Counting
Use multiple windows:
- 1 minute
- 5 minutes
- 15 minutes
- 1 hour
Track:
- mention count
- unique authors
- retweet ratio
- geographic distribution
- velocity
- acceleration
👉 Interview Answer
Sliding windows let the system measure recent activity and growth rate. But raw count is not enough, because a topic with consistently high volume may be popular but not newly trending.
6️⃣ Baseline and Spike Detection
Compare current activity to:
- same topic historical baseline
- region baseline
- time-of-day baseline
- language baseline
- global popularity baseline
Example:
trend_score = current_velocity / expected_velocity
7️⃣ Spam and Manipulation Filtering
Abuse signals:
- many new accounts
- repeated identical text
- bot-like timing
- low reputation authors
- coordinated retweet clusters
- suspicious geographic concentration
👉 Interview Answer
Trend ranking must include abuse resistance. Without spam filtering, coordinated actors can manufacture trends by creating artificial volume.
8️⃣ Regionalization
Trends are different by:
- country
- city
- language
- user interests
- follow graph
Regionalization prevents global topics from drowning out local events.
9️⃣ Staff-Level Trade-offs
| Decision | Benefit | Cost |
|---|---|---|
| Short windows | Very fresh | Noisy and unstable |
| Long windows | Stable | Slower to detect |
| Strong spam filters | Better quality | Risk of false positives |
| Regional trends | More relevant | More aggregation complexity |
| Personalized trends | Higher relevance | Harder explainability |
中文部分
中文速记
一句话
Twitter Trending 不是找最热门,而是找“相对历史基线异常增长”的话题,再过滤垃圾和操纵。
背诵要点
- trending 是 realtime anomaly detection
- 先从 tweet stream 提取 hashtag、entity、phrase
- sliding windows 统计 recent velocity
- baseline comparison 判断是否异常增长
- spam/bot filtering 防止人为刷榜
中文面试回答
我会把 Twitter Trending 设计成流式异常检测系统。 系统从 tweet stream 中提取 hashtag、实体、短语和 URL,然后按 1 分钟、5 分钟、15 分钟、1 小时这样的 sliding window 做计数。 但 raw count 不够,因为一个长期高热度话题可能很热门,但不一定正在 trending。
所以系统要把当前增长速度和历史 baseline 比较,比如同一话题过去的活跃度、地区 baseline、时间段 baseline 和语言 baseline。 之后再过滤机器人、重复文本、低信誉账号、异常转发集群和协调操纵行为。
Staff 级重点是:trending 的目标是发现新鲜、有意义、异常增长的话题。 核心权衡是 freshness、stability 和 abuse resistance。 窗口太短会很新但噪声大,过滤太严又可能误杀真实热点。
✅ Final Interview Answer
A Twitter-like trending topics system ingests tweet events, extracts hashtags and entities, counts them over sliding windows, and compares current growth against historical baselines. The goal is to find unusual growth, not just high absolute volume. After candidate trends are detected, the system applies spam, bot, duplicate, language, and regional filters before ranking final trends.
At staff level, the key trade-off is freshness versus stability and abuse resistance. Very short windows detect trends quickly but are noisy. Strong filters improve quality but may suppress legitimate trends. A good system combines realtime stream processing with baseline models, regionalization, and manipulation defenses.
Implement