🎯 Design Real-time Analytics
1️⃣ Core Framework
When discussing Real-time Analytics design, I frame it as:
- Event ingestion
- Durable event stream
- Stream processing
- Windowed aggregation
- Real-time storage / OLAP serving
- Dashboard and query layer
- Late events, deduplication, and correctness
- Trade-offs: latency vs accuracy vs cost
2️⃣ Core Requirements
Functional Requirements
- Ingest events in real time
- Support metrics like counts, sums, averages, percentiles
- Support time-window aggregation
- Support dashboards
- Support filters and group-by
- Support alerts
- Support drill-down queries
- Support replay / backfill
Non-functional Requirements
- Low ingestion latency
- High write throughput
- Low dashboard query latency
- High availability
- Scalable stream processing
- Durable raw event storage
- Eventually consistent analytics acceptable
- Data freshness should be seconds to minutes
👉 Interview Answer
Real-time analytics collects events, processes them continuously, aggregates them over time windows, and serves low-latency dashboards and alerts.
The main challenge is balancing freshness, accuracy, throughput, query latency, and cost.
3️⃣ Main APIs
Ingest Event
POST /api/events
Request:
{
"eventId": "evt_123",
"eventName": "purchase_completed",
"userId": "u456",
"timestamp": "2026-05-03T10:00:00Z",
"properties": {
"amount": 49.99,
"region": "NA",
"device": "mobile"
}
}
Query Metric
GET /api/analytics/query?metric=purchase_count&from=...&to=...&groupBy=region
Create Alert
POST /api/alerts
Request:
{
"metric": "error_rate",
"condition": "> 5%",
"window": "5m",
"groupBy": ["service"]
}
👉 Interview Answer
I would expose APIs for event ingestion, metric querying, and alert rule creation.
Ingestion should be optimized for high throughput, while query APIs should read from pre-aggregated real-time stores.
4️⃣ High-Level Architecture
Client / Service Events
→ Event Collector
→ Kafka / Pulsar / Kinesis
→ Stream Processor
→ Real-time Aggregation Store
→ Query Service
→ Dashboard / Alerting
Raw Events
→ Object Storage
→ Batch Backfill / Replay
Main Components
Event Collector
- Validates events
- Adds server receive time
- Performs light enrichment
- Writes to durable stream
Durable Stream
- Buffers events
- Supports replay
- Handles backpressure
- Decouples producers and processors
Stream Processor
- Deduplicates
- Handles late events
- Performs window aggregation
- Writes aggregates
Real-time Store
- Serves dashboard queries
- Supports time-range and group-by queries
👉 Interview Answer
I would design real-time analytics as a streaming pipeline.
Events are written to a durable log, stream processors continuously aggregate them, and dashboards query a real-time OLAP store.
Raw events should also be stored for replay and backfill.
5️⃣ Event Ingestion
Ingestion Flow
Event received
→ Validate schema
→ Add received_at
→ Enrich with metadata
→ Write to durable stream
→ Acknowledge producer
Why Durable Stream?
A durable stream like Kafka gives:
- High throughput
- Replayability
- Backpressure handling
- Fault tolerance
- Consumer isolation
Partitioning
Common partition keys:
eventName
customerId
userId
tenantId
metricKey
👉 Interview Answer
The collector should do lightweight work only: validation, enrichment, and writing to a durable stream.
Once the event is durably written, the system can acknowledge the producer.
Heavy computation should happen downstream.
6️⃣ Stream Processing
Responsibilities
- Parse events
- Deduplicate events
- Filter invalid events
- Join reference data
- Aggregate metrics
- Handle late events
- Emit alerts
- Write aggregate results
Tools
Examples:
Flink
Spark Structured Streaming
Kafka Streams
Beam
Storm
Why Stream Processing?
Because dashboards and alerts need continuously updated results.
👉 Interview Answer
Stream processors consume events from the durable log, perform transformations and aggregations, and write results to real-time analytics storage.
This allows dashboards and alerts to update continuously instead of waiting for batch jobs.
7️⃣ Windowed Aggregation
Why Windows?
Analytics usually asks:
How many events happened in the last 5 minutes?
Window Types
Tumbling Window
Fixed, non-overlapping windows.
10:00–10:05
10:05–10:10
Sliding Window
Overlapping windows.
last 5 minutes, updated every 1 minute
Session Window
Grouped by activity gaps.
new session after 30 minutes inactivity
Common Aggregations
- count
- sum
- average
- min / max
- unique users
- percentiles
- error rate
- conversion rate
👉 Interview Answer
Windowed aggregation is central to real-time analytics.
Tumbling windows are simple, sliding windows are useful for rolling metrics, and session windows are useful for user behavior analysis.
The system should choose window type based on query needs.
8️⃣ Late Events and Watermarks
Problem
Events may arrive late.
Example:
event_time = 10:00
received_at = 10:05
Causes
- Mobile offline
- Network delay
- Client retry
- Clock skew
- Backpressure
Watermark
A watermark tells the processor:
We probably have received most events before time T.
Allowed Lateness
Example:
accept events up to 5 minutes late
👉 Interview Answer
Real-time analytics must handle late events.
I would store both event time and processing time, use watermarks, and allow a bounded lateness window.
Very late events can be sent to a correction or backfill pipeline.
9️⃣ Deduplication and Exactly-once Effect
Why Duplicates Happen
- Client retries
- Collector retries
- Stream processor restarts
- At-least-once delivery
- Network timeouts
Dedup Key
eventId
or:
source + eventId
Strategies
- Dedup cache
- Idempotent writes
- Exactly-once stream processing where supported
- Upsert aggregates
- Store processed event IDs for critical events
👉 Interview Answer
In practice, exactly-once delivery is hard.
I would aim for exactly-once effect by using event IDs, deduplication, idempotent writes, and transactional stream processing where available.
For most dashboards, small eventual corrections are acceptable.
🔟 Real-time Storage / OLAP Store
Requirements
- Fast time-range queries
- Fast group-by queries
- High write throughput
- Columnar compression
- Rollups and aggregations
- Low-latency dashboard serving
Options
Apache Pinot
Apache Druid
ClickHouse
Elasticsearch
TimescaleDB
BigQuery BI Engine
Data Model Example
analytics_aggregate (
metric_name VARCHAR,
window_start TIMESTAMP,
window_size VARCHAR,
dimensions JSON,
value DOUBLE,
updated_at TIMESTAMP
)
👉 Interview Answer
Real-time analytics needs a serving store optimized for OLAP queries.
Systems like Pinot, Druid, or ClickHouse work well because they support fast time-range queries, group-by aggregations, and high write throughput.
1️⃣1️⃣ Raw Events and Backfill
Why Keep Raw Events?
- Reprocess after bug fix
- Add new metrics
- Backfill dashboards
- Audit data
- Train models
- Correct late data
Storage
Object storage partitioned by date/hour/event type
Example:
s3://events/date=2026-05-03/hour=10/event=purchase/
Replay Flow
Raw events
→ Batch job or replay stream
→ Recompute aggregates
→ Update analytics store
👉 Interview Answer
I would always store raw events durably.
Real-time pipelines may have bugs, so raw events allow replay, backfill, metric correction, and new analytics creation.
1️⃣2️⃣ Query and Dashboard Layer
Dashboard Query Flow
User opens dashboard
→ Query service validates access
→ Reads pre-aggregated data
→ Applies filters/group-by
→ Returns chart data
Query Optimization
- Pre-aggregate common metrics
- Cache dashboard results
- Limit high-cardinality group-bys
- Use rollups for long time ranges
- Use approximate algorithms when acceptable
Example Queries
Orders per minute by region
Error rate by service over last 15 minutes
Active users in last 5 minutes
P95 latency by endpoint
👉 Interview Answer
Dashboards should query pre-aggregated or OLAP-optimized data, not raw events.
To keep dashboards fast, I would cache common queries, use rollups, and limit high-cardinality group-bys.
1️⃣3️⃣ Alerts
Alert Flow
Stream processor computes metric
→ Rule evaluator checks condition
→ Alert fires
→ Dedup / suppress
→ Notification system
Alert Types
- Threshold alert
- Anomaly alert
- Missing data alert
- Rate-of-change alert
- Burn-rate alert
Alert Reliability
- Avoid duplicate alerts
- Use cooldown windows
- Group alerts
- Track alert state
- Support silence / maintenance windows
👉 Interview Answer
Real-time analytics often powers alerting.
The alerting system should evaluate metrics over windows, suppress duplicates, support cooldowns, and route notifications to the right team.
1️⃣4️⃣ High-cardinality Dimensions
Problem
Group-by dimensions can explode.
Examples:
userId
requestId
sessionId
orderId
ipAddress
Impact
- High memory usage
- Slow queries
- Expensive storage
- Poor cache hit rate
Strategies
- Limit allowed dimensions
- Pre-approve schemas
- Use approximate aggregations
- Sample low-value events
- Move high-cardinality drill-down to raw logs
- Apply per-tenant quotas
👉 Interview Answer
High-cardinality dimensions are dangerous in real-time analytics.
I would limit which fields can be used as dimensions, enforce schema controls, and use approximations or sampling when necessary.
Request-level details usually belong in logs, not metrics dashboards.
1️⃣5️⃣ Accuracy vs Latency
Low Latency
Pros:
- Fresh dashboards
- Fast alerting
Cons:
- More incomplete data
- More late-event corrections
- Higher compute cost
Higher Accuracy
Pros:
- Better final reports
- Fewer corrections
Cons:
- Slower dashboards
- More delayed alerts
Common Strategy
Real-time approximate view now
Batch-corrected accurate view later
👉 Interview Answer
Real-time analytics often trades accuracy for freshness.
I would provide a fast near-real-time view and use batch processing later to correct late events and produce accurate historical reports.
1️⃣6️⃣ Scaling Patterns
Pattern 1: Durable Event Log
Use Kafka/Pulsar/Kinesis to absorb spikes.
Pattern 2: Partitioned Stream Processing
Partition by tenant, metric, or entity.
Pattern 3: Pre-aggregation
Compute common metrics before serving queries.
Pattern 4: Rollups
Store multiple resolutions:
1-minute, 5-minute, 1-hour
Pattern 5: Hot / Cold Path
Hot path: stream processing for real-time dashboards
Cold path: batch processing for accuracy and backfill
👉 Interview Answer
To scale real-time analytics, I would use a durable event log, partition stream processing, pre-aggregate common metrics, store rollups, and separate hot real-time processing from cold batch correction.
1️⃣7️⃣ Failure Handling
Common Failures
- Collector overload
- Kafka lag
- Stream processor crash
- Duplicate events
- Late events
- OLAP store unavailable
- Dashboard query timeout
- Bad schema deployment
Strategies
- Backpressure
- Retry with exponential backoff
- Dead-letter queue
- Replay from Kafka
- Rebuild from raw events
- Fallback to cached dashboard
- Schema rollback
- Data quality alerts
👉 Interview Answer
The system should be designed for replayability.
If stream processing fails, we can replay from Kafka.
If aggregates are wrong, we can rebuild from raw events.
If dashboards fail, we can serve cached or degraded data.
1️⃣8️⃣ Consistency Model
Stronger Consistency Needed For
- Access control
- Billing-related analytics
- Experiment assignment
- Critical alert rules
- Data deletion / privacy requests
Eventual Consistency Acceptable For
- Dashboards
- Aggregated metrics
- Funnel reports
- Retention reports
- Trend analysis
- Most alerts with small delay tolerance
👉 Interview Answer
Real-time analytics is usually eventually consistent.
A metric being corrected a few minutes later is acceptable for many dashboards.
But access control, billing-related metrics, experiment assignment, and privacy deletion requests need stronger correctness.
1️⃣9️⃣ Observability
Key Metrics
- Event ingestion QPS
- Collector latency
- Kafka lag
- Stream processing lag
- Watermark delay
- Late event rate
- Duplicate event rate
- OLAP write latency
- Dashboard query latency
- Data freshness
- Alert evaluation delay
- DLQ event count
👉 Interview Answer
I would monitor ingestion QPS, queue lag, stream processing lag, watermark delay, late event rate, duplicate event rate, OLAP write latency, dashboard query latency, data freshness, and alert delay.
These metrics show whether real-time analytics is fresh, correct, and reliable.
2️⃣0️⃣ End-to-End Flow
Real-time Metric Flow
Event produced
→ Collector validates event
→ Kafka stores event
→ Stream processor aggregates by window
→ OLAP store receives aggregate
→ Dashboard queries latest metric
Backfill Flow
Raw events stored in object storage
→ Batch job reprocesses data
→ Correct aggregates
→ Update historical analytics table
Alert Flow
Stream processor computes metric
→ Alert rule evaluates condition
→ Alert fires
→ Notification sent
→ Alert state tracked
Key Insight
Real-time Analytics is not just querying fresh data — it is a streaming aggregation system with replay, correction, and serving layers.
🧠 Staff-Level Answer (Final)
👉 Interview Answer (Full Version)
When designing a real-time analytics system, I think of it as a streaming aggregation and serving platform.
Events are collected from clients and backend services, validated by collectors, and written to a durable event log such as Kafka.
Stream processors consume events continuously, deduplicate them, handle late arrivals with watermarks, aggregate metrics over time windows, and write results into a real-time OLAP store.
Dashboards and alerting systems query the OLAP store, not raw events, because dashboards need low latency.
I would keep raw events in object storage so the system can replay, backfill, and correct metrics if the pipeline changes or bugs are found.
Windowed aggregation is central. Tumbling windows are useful for fixed intervals, sliding windows for rolling metrics, and session windows for user behavior.
High-cardinality dimensions must be controlled, because they can make storage and queries very expensive.
Real-time analytics usually accepts eventual consistency. A dashboard may be slightly delayed, and late events can be corrected later.
But access control, privacy deletion, experiment assignment, and billing-related metrics require stronger correctness.
The main trade-offs are freshness, accuracy, query latency, compute cost, storage cost, and operational complexity.
Ultimately, the goal is to provide fresh, reliable, and low-latency metrics for dashboards and alerts, while preserving raw data for replay and correction.
⭐ Final Insight
Real-time Analytics 的核心不是“查最新数据”, 而是一个由 event stream、window aggregation、OLAP serving、replay/backfill 组成的低延迟分析系统。
中文部分
🎯 Design Real-time Analytics
1️⃣ 核心框架
在设计 Real-time Analytics 时,我通常从以下几个方面分析:
- Event ingestion
- Durable event stream
- Stream processing
- Windowed aggregation
- Real-time storage / OLAP serving
- Dashboard and query layer
- Late events、deduplication 和 correctness
- 核心权衡:latency vs accuracy vs cost
2️⃣ 核心需求
功能需求
- 实时 ingest events
- 支持 counts、sums、averages、percentiles
- 支持 time-window aggregation
- 支持 dashboards
- 支持 filters 和 group-by
- 支持 alerts
- 支持 drill-down queries
- 支持 replay / backfill
非功能需求
- 低 ingestion 延迟
- 高写入吞吐
- 低 dashboard query latency
- 高可用
- 可扩展 stream processing
- Raw event 持久化存储
- Analytics 可以最终一致
- Data freshness 应该在秒到分钟级
👉 面试回答
Real-time Analytics 会收集 events, 持续处理这些 events, 按时间窗口进行聚合, 并服务低延迟 dashboards 和 alerts。
核心挑战是在 freshness、accuracy、 throughput、query latency 和 cost 之间做平衡。
3️⃣ Main APIs
Ingest Event
POST /api/events
Request:
{
"eventId": "evt_123",
"eventName": "purchase_completed",
"userId": "u456",
"timestamp": "2026-05-03T10:00:00Z",
"properties": {
"amount": 49.99,
"region": "NA",
"device": "mobile"
}
}
Query Metric
GET /api/analytics/query?metric=purchase_count&from=...&to=...&groupBy=region
Create Alert
POST /api/alerts
Request:
{
"metric": "error_rate",
"condition": "> 5%",
"window": "5m",
"groupBy": ["service"]
}
👉 面试回答
我会提供 event ingestion、metric querying 和 alert rule creation APIs。
Ingestion 需要为高吞吐优化, query APIs 应该从预聚合的 real-time stores 读取。
4️⃣ High-Level Architecture
Client / Service Events
→ Event Collector
→ Kafka / Pulsar / Kinesis
→ Stream Processor
→ Real-time Aggregation Store
→ Query Service
→ Dashboard / Alerting
Raw Events
→ Object Storage
→ Batch Backfill / Replay
Main Components
Event Collector
- Validate events
- Add server receive time
- Light enrichment
- Write to durable stream
Durable Stream
- Buffer events
- Support replay
- Handle backpressure
- Decouple producers and processors
Stream Processor
- Deduplicate
- Handle late events
- Perform window aggregation
- Write aggregates
Real-time Store
- Serve dashboard queries
- Support time-range and group-by queries
👉 面试回答
我会将 real-time analytics 设计成 streaming pipeline。
Events 写入 durable log, stream processors 持续聚合 events, dashboards 查询 real-time OLAP store。
Raw events 也应该被保存, 用于 replay 和 backfill。
5️⃣ Event Ingestion
Ingestion Flow
Event received
→ Validate schema
→ Add received_at
→ Enrich with metadata
→ Write to durable stream
→ Acknowledge producer
Why Durable Stream?
Kafka 这类 durable stream 提供:
- High throughput
- Replayability
- Backpressure handling
- Fault tolerance
- Consumer isolation
Partitioning
常见 partition keys:
eventName
customerId
userId
tenantId
metricKey
👉 面试回答
Collector 应该只做轻量工作: validation、enrichment, 然后写入 durable stream。
事件持久写入后, 系统就可以 acknowledge producer。
重计算应该放到 downstream。
6️⃣ Stream Processing
Responsibilities
- Parse events
- Deduplicate events
- Filter invalid events
- Join reference data
- Aggregate metrics
- Handle late events
- Emit alerts
- Write aggregate results
Tools
例如:
Flink
Spark Structured Streaming
Kafka Streams
Beam
Storm
Why Stream Processing?
因为 dashboards 和 alerts 需要持续更新的结果。
👉 面试回答
Stream processors 从 durable log 消费 events, 执行 transformation 和 aggregation, 并将结果写入 real-time analytics storage。
这样 dashboards 和 alerts 可以持续更新, 不需要等待 batch jobs。
7️⃣ Windowed Aggregation
为什么需要 Windows?
Analytics 通常会问:
How many events happened in the last 5 minutes?
Window Types
Tumbling Window
固定、不重叠窗口。
10:00–10:05
10:05–10:10
Sliding Window
重叠窗口。
last 5 minutes, updated every 1 minute
Session Window
按活动间隔分组。
new session after 30 minutes inactivity
Common Aggregations
- count
- sum
- average
- min / max
- unique users
- percentiles
- error rate
- conversion rate
👉 面试回答
Windowed aggregation 是 real-time analytics 的核心。
Tumbling windows 简单, sliding windows 适合 rolling metrics, session windows 适合 user behavior analysis。
系统应该根据 query needs 选择 window type。
8️⃣ Late Events and Watermarks
Problem
Events 可能迟到。
示例:
event_time = 10:00
received_at = 10:05
Causes
- Mobile offline
- Network delay
- Client retry
- Clock skew
- Backpressure
Watermark
Watermark 告诉 processor:
We probably have received most events before time T.
Allowed Lateness
示例:
accept events up to 5 minutes late
👉 面试回答
Real-time analytics 必须处理 late events。
我会同时保存 event time 和 processing time, 使用 watermarks, 并允许一个有限的 lateness window。
非常迟到的 events 可以进入 correction 或 backfill pipeline。
9️⃣ Deduplication and Exactly-once Effect
为什么会有 Duplicates?
- Client retries
- Collector retries
- Stream processor restarts
- At-least-once delivery
- Network timeouts
Dedup Key
eventId
或者:
source + eventId
Strategies
- Dedup cache
- Idempotent writes
- 支持时使用 exactly-once stream processing
- Upsert aggregates
- Critical events 存储 processed event IDs
👉 面试回答
实际上 exactly-once delivery 很难。
我会追求 exactly-once effect, 使用 event IDs、deduplication、idempotent writes, 以及可用时的 transactional stream processing。
对大多数 dashboards, 少量 eventual corrections 是可以接受的。
🔟 Real-time Storage / OLAP Store
Requirements
- 快速 time-range queries
- 快速 group-by queries
- 高写入吞吐
- Columnar compression
- Rollups and aggregations
- 低延迟 dashboard serving
Options
Apache Pinot
Apache Druid
ClickHouse
Elasticsearch
TimescaleDB
BigQuery BI Engine
Data Model Example
analytics_aggregate (
metric_name VARCHAR,
window_start TIMESTAMP,
window_size VARCHAR,
dimensions JSON,
value DOUBLE,
updated_at TIMESTAMP
)
👉 面试回答
Real-time analytics 需要一个为 OLAP queries 优化的 serving store。
Pinot、Druid 或 ClickHouse 这类系统适合, 因为它们支持快速 time-range queries、 group-by aggregations 和高写入吞吐。
1️⃣1️⃣ Raw Events and Backfill
为什么保留 Raw Events?
- Reprocess after bug fix
- Add new metrics
- Backfill dashboards
- Audit data
- Train models
- Correct late data
Storage
Object storage partitioned by date/hour/event type
示例:
s3://events/date=2026-05-03/hour=10/event=purchase/
Replay Flow
Raw events
→ Batch job or replay stream
→ Recompute aggregates
→ Update analytics store
👉 面试回答
我会始终持久保存 raw events。
Real-time pipelines 可能有 bug, 所以 raw events 支持 replay、backfill、 metric correction 和创建新 analytics。
1️⃣2️⃣ Query and Dashboard Layer
Dashboard Query Flow
User opens dashboard
→ Query service validates access
→ Reads pre-aggregated data
→ Applies filters/group-by
→ Returns chart data
Query Optimization
- Pre-aggregate common metrics
- Cache dashboard results
- 限制 high-cardinality group-bys
- 长时间范围使用 rollups
- 可接受时使用 approximate algorithms
Example Queries
Orders per minute by region
Error rate by service over last 15 minutes
Active users in last 5 minutes
P95 latency by endpoint
👉 面试回答
Dashboards 应该查询 pre-aggregated 或 OLAP-optimized data, 而不是 raw events。
为了保证 dashboard 快, 我会缓存常见 queries, 使用 rollups, 并限制 high-cardinality group-bys。
1️⃣3️⃣ Alerts
Alert Flow
Stream processor computes metric
→ Rule evaluator checks condition
→ Alert fires
→ Dedup / suppress
→ Notification system
Alert Types
- Threshold alert
- Anomaly alert
- Missing data alert
- Rate-of-change alert
- Burn-rate alert
Alert Reliability
- 避免 duplicate alerts
- 使用 cooldown windows
- Group alerts
- Track alert state
- 支持 silence / maintenance windows
👉 面试回答
Real-time analytics 经常用于 alerting。
Alerting system 应该基于 windows 评估 metrics, 抑制重复 alerts, 支持 cooldown, 并将通知 route 给正确 team。
1️⃣4️⃣ High-cardinality Dimensions
Problem
Group-by dimensions 可能爆炸。
示例:
userId
requestId
sessionId
orderId
ipAddress
Impact
- High memory usage
- Slow queries
- Expensive storage
- Poor cache hit rate
Strategies
- 限制 allowed dimensions
- 预先审批 schemas
- 使用 approximate aggregations
- 对低价值 events 采样
- High-cardinality drill-down 交给 raw logs
- Apply per-tenant quotas
👉 面试回答
High-cardinality dimensions 对 real-time analytics 很危险。
我会限制哪些字段可以作为 dimensions, 强制 schema controls, 必要时使用 approximations 或 sampling。
Request-level details 通常应该进入 logs, 而不是 metrics dashboards。
1️⃣5️⃣ Accuracy vs Latency
Low Latency
优点:
- Fresh dashboards
- Fast alerting
缺点:
- 数据更不完整
- 需要更多 late-event corrections
- Compute cost 更高
Higher Accuracy
优点:
- Final reports 更准确
- Corrections 更少
缺点:
- Dashboards 更慢
- Alerts 更延迟
Common Strategy
Real-time approximate view now
Batch-corrected accurate view later
👉 面试回答
Real-time analytics 经常用 accuracy 换 freshness。
我会提供快速 near-real-time view, 然后用 batch processing 后续修正 late events, 生成准确历史 reports。
1️⃣6️⃣ Scaling Patterns
Pattern 1: Durable Event Log
使用 Kafka / Pulsar / Kinesis 吸收流量峰值。
Pattern 2: Partitioned Stream Processing
按 tenant、metric 或 entity 分区。
Pattern 3: Pre-aggregation
Serving query 前预计算常见 metrics。
Pattern 4: Rollups
存储多种分辨率:
1-minute, 5-minute, 1-hour
Pattern 5: Hot / Cold Path
Hot path: stream processing for real-time dashboards
Cold path: batch processing for accuracy and backfill
👉 面试回答
为了扩展 real-time analytics, 我会使用 durable event log, 对 stream processing 分区, 预聚合常见 metrics, 存储 rollups, 并将 hot real-time processing 和 cold batch correction 分开。
1️⃣7️⃣ Failure Handling
Common Failures
- Collector overload
- Kafka lag
- Stream processor crash
- Duplicate events
- Late events
- OLAP store unavailable
- Dashboard query timeout
- Bad schema deployment
Strategies
- Backpressure
- Retry with exponential backoff
- Dead-letter queue
- Replay from Kafka
- Rebuild from raw events
- Fallback to cached dashboard
- Schema rollback
- Data quality alerts
👉 面试回答
系统应该为 replayability 设计。
如果 stream processing 失败, 可以从 Kafka replay。
如果 aggregates 错了, 可以从 raw events 重建。
如果 dashboards 失败, 可以返回 cached 或 degraded data。
1️⃣8️⃣ Consistency Model
需要较强一致性的场景
- Access control
- Billing-related analytics
- Experiment assignment
- Critical alert rules
- Data deletion / privacy requests
可以最终一致的场景
- Dashboards
- Aggregated metrics
- Funnel reports
- Retention reports
- Trend analysis
- Most alerts with small delay tolerance
👉 面试回答
Real-time analytics 通常是最终一致的。
指标几分钟后被 correction, 对很多 dashboards 来说是可以接受的。
但 access control、billing-related metrics、 experiment assignment 和 privacy deletion requests 需要更强正确性。
1️⃣9️⃣ Observability
Key Metrics
- Event ingestion QPS
- Collector latency
- Kafka lag
- Stream processing lag
- Watermark delay
- Late event rate
- Duplicate event rate
- OLAP write latency
- Dashboard query latency
- Data freshness
- Alert evaluation delay
- DLQ event count
👉 面试回答
我会监控 ingestion QPS、queue lag、 stream processing lag、watermark delay、 late event rate、duplicate event rate、 OLAP write latency、dashboard query latency、 data freshness 和 alert delay。
这些指标可以说明 real-time analytics 是否 fresh、correct 和 reliable。
2️⃣0️⃣ End-to-End Flow
Real-time Metric Flow
Event produced
→ Collector validates event
→ Kafka stores event
→ Stream processor aggregates by window
→ OLAP store receives aggregate
→ Dashboard queries latest metric
Backfill Flow
Raw events stored in object storage
→ Batch job reprocesses data
→ Correct aggregates
→ Update historical analytics table
Alert Flow
Stream processor computes metric
→ Alert rule evaluates condition
→ Alert fires
→ Notification sent
→ Alert state tracked
Key Insight
Real-time Analytics 不是简单查询新数据, 而是 streaming aggregation、replay、correction 和 serving layers 组成的系统。
🧠 Staff-Level Answer(最终版)
👉 面试回答(完整背诵版)
在设计 Real-time Analytics System 时, 我会把它看作一个 streaming aggregation 和 serving platform。
Events 从 clients 和 backend services 收集, 由 collectors 进行校验, 然后写入 Kafka 这类 durable event log。
Stream processors 会持续消费 events, 去重, 使用 watermarks 处理 late arrivals, 按 time windows 聚合 metrics, 并将结果写入 real-time OLAP store。
Dashboards 和 alerting systems 查询 OLAP store, 而不是 raw events, 因为 dashboards 需要低延迟。
我会将 raw events 保存在 object storage 中, 这样当 pipeline 变化或发现 bug 时, 系统可以 replay、backfill 和修正 metrics。
Windowed aggregation 是核心。 Tumbling windows 适合固定时间区间, sliding windows 适合 rolling metrics, session windows 适合 user behavior。
High-cardinality dimensions 必须被控制, 因为它们会让 storage 和 queries 变得非常昂贵。
Real-time analytics 通常可以最终一致。 Dashboard 可以轻微延迟, late events 可以之后修正。
但 access control、privacy deletion、 experiment assignment 和 billing-related metrics 需要更强正确性。
核心权衡包括 freshness、accuracy、query latency、 compute cost、storage cost 和 operational complexity。
最终目标是为 dashboards 和 alerts 提供 fresh、reliable、low-latency metrics, 同时保留 raw data 用于 replay 和 correction。
⭐ Final Insight
Real-time Analytics 的核心不是“查最新数据”, 而是一个由 event stream、window aggregation、OLAP serving、replay/backfill 组成的低延迟分析系统。
Implement