d&d-t System Design Deep Dive ·

🎯 Design Real-time Analytics

1️⃣ Core Framework

When discussing Real-time Analytics design, I frame it as:

Event ingestion
Durable event stream
Stream processing
Windowed aggregation
Real-time storage / OLAP serving
Dashboard and query layer
Late events, deduplication, and correctness
Trade-offs: latency vs accuracy vs cost

2️⃣ Core Requirements

Functional Requirements

Ingest events in real time
Support metrics like counts, sums, averages, percentiles
Support time-window aggregation
Support dashboards
Support filters and group-by
Support alerts
Support drill-down queries
Support replay / backfill

Non-functional Requirements

Low ingestion latency
High write throughput
Low dashboard query latency
High availability
Scalable stream processing
Durable raw event storage
Eventually consistent analytics acceptable
Data freshness should be seconds to minutes

👉 Interview Answer

Real-time analytics collects events, processes them continuously, aggregates them over time windows, and serves low-latency dashboards and alerts.

The main challenge is balancing freshness, accuracy, throughput, query latency, and cost.

3️⃣ Main APIs

Ingest Event

POST /api/events

Request:

{
  "eventId": "evt_123",
  "eventName": "purchase_completed",
  "userId": "u456",
  "timestamp": "2026-05-03T10:00:00Z",
  "properties": {
    "amount": 49.99,
    "region": "NA",
    "device": "mobile"
  }
}

Query Metric

GET /api/analytics/query?metric=purchase_count&from=...&to=...&groupBy=region

Create Alert

POST /api/alerts

Request:

{
  "metric": "error_rate",
  "condition": "> 5%",
  "window": "5m",
  "groupBy": ["service"]
}

👉 Interview Answer

I would expose APIs for event ingestion, metric querying, and alert rule creation.

Ingestion should be optimized for high throughput, while query APIs should read from pre-aggregated real-time stores.

4️⃣ High-Level Architecture

Client / Service Events
→ Event Collector
→ Kafka / Pulsar / Kinesis
→ Stream Processor
→ Real-time Aggregation Store
→ Query Service
→ Dashboard / Alerting

Raw Events
→ Object Storage
→ Batch Backfill / Replay

Main Components

Event Collector

Validates events
Adds server receive time
Performs light enrichment
Writes to durable stream

Durable Stream

Buffers events
Supports replay
Handles backpressure
Decouples producers and processors

Stream Processor

Deduplicates
Handles late events
Performs window aggregation
Writes aggregates

Real-time Store

Serves dashboard queries
Supports time-range and group-by queries

👉 Interview Answer

I would design real-time analytics as a streaming pipeline.

Events are written to a durable log, stream processors continuously aggregate them, and dashboards query a real-time OLAP store.

Raw events should also be stored for replay and backfill.

5️⃣ Event Ingestion

Ingestion Flow

Event received
→ Validate schema
→ Add received_at
→ Enrich with metadata
→ Write to durable stream
→ Acknowledge producer

Why Durable Stream?

A durable stream like Kafka gives:

High throughput
Replayability
Backpressure handling
Fault tolerance
Consumer isolation

Partitioning

Common partition keys:

eventName
customerId
userId
tenantId
metricKey

👉 Interview Answer

The collector should do lightweight work only: validation, enrichment, and writing to a durable stream.

Once the event is durably written, the system can acknowledge the producer.

Heavy computation should happen downstream.

6️⃣ Stream Processing

Responsibilities

Parse events
Deduplicate events
Filter invalid events
Join reference data
Aggregate metrics
Handle late events
Emit alerts
Write aggregate results

Tools

Examples:

Flink
Spark Structured Streaming
Kafka Streams
Beam
Storm

Why Stream Processing?

Because dashboards and alerts need continuously updated results.

👉 Interview Answer

Stream processors consume events from the durable log, perform transformations and aggregations, and write results to real-time analytics storage.

This allows dashboards and alerts to update continuously instead of waiting for batch jobs.

7️⃣ Windowed Aggregation

Why Windows?

Analytics usually asks:

How many events happened in the last 5 minutes?

Window Types

Tumbling Window

Fixed, non-overlapping windows.

10:00–10:05
10:05–10:10

Sliding Window

Overlapping windows.

last 5 minutes, updated every 1 minute

Session Window

Grouped by activity gaps.

new session after 30 minutes inactivity

Common Aggregations

count
sum
average
min / max
unique users
percentiles
error rate
conversion rate

👉 Interview Answer

Windowed aggregation is central to real-time analytics.

Tumbling windows are simple, sliding windows are useful for rolling metrics, and session windows are useful for user behavior analysis.

The system should choose window type based on query needs.

8️⃣ Late Events and Watermarks

Problem

Events may arrive late.

Example:

event_time = 10:00
received_at = 10:05

Causes

Mobile offline
Network delay
Client retry
Clock skew
Backpressure

Watermark

A watermark tells the processor:

We probably have received most events before time T.

Allowed Lateness

Example:

accept events up to 5 minutes late

👉 Interview Answer

Real-time analytics must handle late events.

I would store both event time and processing time, use watermarks, and allow a bounded lateness window.

Very late events can be sent to a correction or backfill pipeline.

9️⃣ Deduplication and Exactly-once Effect

Why Duplicates Happen

Client retries
Collector retries
Stream processor restarts
At-least-once delivery
Network timeouts

Dedup Key

eventId

or:

source + eventId

Strategies

Dedup cache
Idempotent writes
Exactly-once stream processing where supported
Upsert aggregates
Store processed event IDs for critical events

👉 Interview Answer

In practice, exactly-once delivery is hard.

I would aim for exactly-once effect by using event IDs, deduplication, idempotent writes, and transactional stream processing where available.

For most dashboards, small eventual corrections are acceptable.

🔟 Real-time Storage / OLAP Store

Requirements

Fast time-range queries
Fast group-by queries
High write throughput
Columnar compression
Rollups and aggregations
Low-latency dashboard serving

Options

Apache Pinot
Apache Druid
ClickHouse
Elasticsearch
TimescaleDB
BigQuery BI Engine

Data Model Example

analytics_aggregate (
  metric_name VARCHAR,
  window_start TIMESTAMP,
  window_size VARCHAR,
  dimensions JSON,
  value DOUBLE,
  updated_at TIMESTAMP
)

👉 Interview Answer

Real-time analytics needs a serving store optimized for OLAP queries.

Systems like Pinot, Druid, or ClickHouse work well because they support fast time-range queries, group-by aggregations, and high write throughput.

1️⃣1️⃣ Raw Events and Backfill

Why Keep Raw Events?

Reprocess after bug fix
Add new metrics
Backfill dashboards
Audit data
Train models
Correct late data

Storage

Object storage partitioned by date/hour/event type

Example:

s3://events/date=2026-05-03/hour=10/event=purchase/

Replay Flow

Raw events
→ Batch job or replay stream
→ Recompute aggregates
→ Update analytics store

👉 Interview Answer

I would always store raw events durably.

Real-time pipelines may have bugs, so raw events allow replay, backfill, metric correction, and new analytics creation.

1️⃣2️⃣ Query and Dashboard Layer

Dashboard Query Flow

User opens dashboard
→ Query service validates access
→ Reads pre-aggregated data
→ Applies filters/group-by
→ Returns chart data

Query Optimization

Pre-aggregate common metrics
Cache dashboard results
Limit high-cardinality group-bys
Use rollups for long time ranges
Use approximate algorithms when acceptable

Example Queries

Orders per minute by region
Error rate by service over last 15 minutes
Active users in last 5 minutes
P95 latency by endpoint

👉 Interview Answer

Dashboards should query pre-aggregated or OLAP-optimized data, not raw events.

To keep dashboards fast, I would cache common queries, use rollups, and limit high-cardinality group-bys.

1️⃣3️⃣ Alerts

Alert Flow

Stream processor computes metric
→ Rule evaluator checks condition
→ Alert fires
→ Dedup / suppress
→ Notification system

Alert Types

Threshold alert
Anomaly alert
Missing data alert
Rate-of-change alert
Burn-rate alert

Alert Reliability

Avoid duplicate alerts
Use cooldown windows
Group alerts
Track alert state
Support silence / maintenance windows

👉 Interview Answer

Real-time analytics often powers alerting.

The alerting system should evaluate metrics over windows, suppress duplicates, support cooldowns, and route notifications to the right team.

1️⃣4️⃣ High-cardinality Dimensions

Problem

Group-by dimensions can explode.

Examples:

userId
requestId
sessionId
orderId
ipAddress

Impact

High memory usage
Slow queries
Expensive storage
Poor cache hit rate

Strategies

Limit allowed dimensions
Pre-approve schemas
Use approximate aggregations
Sample low-value events
Move high-cardinality drill-down to raw logs
Apply per-tenant quotas

👉 Interview Answer

High-cardinality dimensions are dangerous in real-time analytics.

I would limit which fields can be used as dimensions, enforce schema controls, and use approximations or sampling when necessary.

Request-level details usually belong in logs, not metrics dashboards.

1️⃣5️⃣ Accuracy vs Latency

Low Latency

Pros:

Fresh dashboards
Fast alerting

Cons:

More incomplete data
More late-event corrections
Higher compute cost

Higher Accuracy

Pros:

Better final reports
Fewer corrections

Cons:

Slower dashboards
More delayed alerts

Common Strategy

Real-time approximate view now
Batch-corrected accurate view later

👉 Interview Answer

Real-time analytics often trades accuracy for freshness.

I would provide a fast near-real-time view and use batch processing later to correct late events and produce accurate historical reports.

1️⃣6️⃣ Scaling Patterns

Pattern 1: Durable Event Log

Use Kafka/Pulsar/Kinesis to absorb spikes.

Pattern 2: Partitioned Stream Processing

Partition by tenant, metric, or entity.

Pattern 3: Pre-aggregation

Compute common metrics before serving queries.

Pattern 4: Rollups

Store multiple resolutions:

1-minute, 5-minute, 1-hour

Pattern 5: Hot / Cold Path

Hot path: stream processing for real-time dashboards
Cold path: batch processing for accuracy and backfill

👉 Interview Answer

To scale real-time analytics, I would use a durable event log, partition stream processing, pre-aggregate common metrics, store rollups, and separate hot real-time processing from cold batch correction.

1️⃣7️⃣ Failure Handling

Common Failures

Collector overload
Kafka lag
Stream processor crash
Duplicate events
Late events
OLAP store unavailable
Dashboard query timeout
Bad schema deployment

Strategies

Backpressure
Retry with exponential backoff
Dead-letter queue
Replay from Kafka
Rebuild from raw events
Fallback to cached dashboard
Schema rollback
Data quality alerts

👉 Interview Answer

The system should be designed for replayability.

If stream processing fails, we can replay from Kafka.

If aggregates are wrong, we can rebuild from raw events.

If dashboards fail, we can serve cached or degraded data.

1️⃣8️⃣ Consistency Model

Stronger Consistency Needed For

Access control
Billing-related analytics
Experiment assignment
Critical alert rules
Data deletion / privacy requests

Eventual Consistency Acceptable For

Dashboards
Aggregated metrics
Funnel reports
Retention reports
Trend analysis
Most alerts with small delay tolerance

👉 Interview Answer

Real-time analytics is usually eventually consistent.

A metric being corrected a few minutes later is acceptable for many dashboards.

But access control, billing-related metrics, experiment assignment, and privacy deletion requests need stronger correctness.

1️⃣9️⃣ Observability

Key Metrics

Event ingestion QPS
Collector latency
Kafka lag
Stream processing lag
Watermark delay
Late event rate
Duplicate event rate
OLAP write latency
Dashboard query latency
Data freshness
Alert evaluation delay
DLQ event count

👉 Interview Answer

I would monitor ingestion QPS, queue lag, stream processing lag, watermark delay, late event rate, duplicate event rate, OLAP write latency, dashboard query latency, data freshness, and alert delay.

These metrics show whether real-time analytics is fresh, correct, and reliable.

2️⃣0️⃣ End-to-End Flow

Real-time Metric Flow

Event produced
→ Collector validates event
→ Kafka stores event
→ Stream processor aggregates by window
→ OLAP store receives aggregate
→ Dashboard queries latest metric

Backfill Flow

Raw events stored in object storage
→ Batch job reprocesses data
→ Correct aggregates
→ Update historical analytics table

Alert Flow

Stream processor computes metric
→ Alert rule evaluates condition
→ Alert fires
→ Notification sent
→ Alert state tracked

Key Insight

Real-time Analytics is not just querying fresh data — it is a streaming aggregation system with replay, correction, and serving layers.

🧠 Staff-Level Answer (Final)

👉 Interview Answer (Full Version)

When designing a real-time analytics system, I think of it as a streaming aggregation and serving platform.

Events are collected from clients and backend services, validated by collectors, and written to a durable event log such as Kafka.

Stream processors consume events continuously, deduplicate them, handle late arrivals with watermarks, aggregate metrics over time windows, and write results into a real-time OLAP store.

Dashboards and alerting systems query the OLAP store, not raw events, because dashboards need low latency.

I would keep raw events in object storage so the system can replay, backfill, and correct metrics if the pipeline changes or bugs are found.

Windowed aggregation is central. Tumbling windows are useful for fixed intervals, sliding windows for rolling metrics, and session windows for user behavior.

High-cardinality dimensions must be controlled, because they can make storage and queries very expensive.

Real-time analytics usually accepts eventual consistency. A dashboard may be slightly delayed, and late events can be corrected later.

But access control, privacy deletion, experiment assignment, and billing-related metrics require stronger correctness.

The main trade-offs are freshness, accuracy, query latency, compute cost, storage cost, and operational complexity.

Ultimately, the goal is to provide fresh, reliable, and low-latency metrics for dashboards and alerts, while preserving raw data for replay and correction.

⭐ Final Insight

Real-time Analytics 的核心不是“查最新数据”，而是一个由 event stream、window aggregation、OLAP serving、replay/backfill 组成的低延迟分析系统。

中文部分

🎯 Design Real-time Analytics

1️⃣ 核心框架

在设计 Real-time Analytics 时，我通常从以下几个方面分析：

Event ingestion
Durable event stream
Stream processing
Windowed aggregation
Real-time storage / OLAP serving
Dashboard and query layer
Late events、deduplication 和 correctness
核心权衡：latency vs accuracy vs cost

2️⃣ 核心需求

功能需求

实时 ingest events
支持 counts、sums、averages、percentiles
支持 time-window aggregation
支持 dashboards
支持 filters 和 group-by
支持 alerts
支持 drill-down queries
支持 replay / backfill

非功能需求

低 ingestion 延迟
高写入吞吐
低 dashboard query latency
高可用
可扩展 stream processing
Raw event 持久化存储
Analytics 可以最终一致
Data freshness 应该在秒到分钟级

👉 面试回答

Real-time Analytics 会收集 events，持续处理这些 events，按时间窗口进行聚合，并服务低延迟 dashboards 和 alerts。

核心挑战是在 freshness、accuracy、 throughput、query latency 和 cost 之间做平衡。

3️⃣ Main APIs

Ingest Event

POST /api/events

Request:

{
  "eventId": "evt_123",
  "eventName": "purchase_completed",
  "userId": "u456",
  "timestamp": "2026-05-03T10:00:00Z",
  "properties": {
    "amount": 49.99,
    "region": "NA",
    "device": "mobile"
  }
}

Query Metric

GET /api/analytics/query?metric=purchase_count&from=...&to=...&groupBy=region

Create Alert

POST /api/alerts

Request:

{
  "metric": "error_rate",
  "condition": "> 5%",
  "window": "5m",
  "groupBy": ["service"]
}

👉 面试回答

我会提供 event ingestion、metric querying 和 alert rule creation APIs。

Ingestion 需要为高吞吐优化， query APIs 应该从预聚合的 real-time stores 读取。

4️⃣ High-Level Architecture

Client / Service Events
→ Event Collector
→ Kafka / Pulsar / Kinesis
→ Stream Processor
→ Real-time Aggregation Store
→ Query Service
→ Dashboard / Alerting

Raw Events
→ Object Storage
→ Batch Backfill / Replay

Main Components

Event Collector

Validate events
Add server receive time
Light enrichment
Write to durable stream

Durable Stream

Buffer events
Support replay
Handle backpressure
Decouple producers and processors

Stream Processor

Deduplicate
Handle late events
Perform window aggregation
Write aggregates

Real-time Store

Serve dashboard queries
Support time-range and group-by queries

👉 面试回答

我会将 real-time analytics 设计成 streaming pipeline。

Events 写入 durable log， stream processors 持续聚合 events， dashboards 查询 real-time OLAP store。

Raw events 也应该被保存，用于 replay 和 backfill。

5️⃣ Event Ingestion

Ingestion Flow

Event received
→ Validate schema
→ Add received_at
→ Enrich with metadata
→ Write to durable stream
→ Acknowledge producer

Why Durable Stream?

Kafka 这类 durable stream 提供：

High throughput
Replayability
Backpressure handling
Fault tolerance
Consumer isolation

Partitioning

常见 partition keys：

eventName
customerId
userId
tenantId
metricKey

👉 面试回答

Collector 应该只做轻量工作： validation、enrichment，然后写入 durable stream。

事件持久写入后，系统就可以 acknowledge producer。

重计算应该放到 downstream。

6️⃣ Stream Processing

Responsibilities

Parse events
Deduplicate events
Filter invalid events
Join reference data
Aggregate metrics
Handle late events
Emit alerts
Write aggregate results

Tools

例如：

Flink
Spark Structured Streaming
Kafka Streams
Beam
Storm

Why Stream Processing?

因为 dashboards 和 alerts 需要持续更新的结果。

👉 面试回答

Stream processors 从 durable log 消费 events，执行 transformation 和 aggregation，并将结果写入 real-time analytics storage。

这样 dashboards 和 alerts 可以持续更新，不需要等待 batch jobs。

7️⃣ Windowed Aggregation

为什么需要 Windows？

Analytics 通常会问：

How many events happened in the last 5 minutes?

Window Types

Tumbling Window

固定、不重叠窗口。

10:00–10:05
10:05–10:10

Sliding Window

重叠窗口。

last 5 minutes, updated every 1 minute

Session Window

按活动间隔分组。

new session after 30 minutes inactivity

Common Aggregations

count
sum
average
min / max
unique users
percentiles
error rate
conversion rate

👉 面试回答

Windowed aggregation 是 real-time analytics 的核心。

Tumbling windows 简单， sliding windows 适合 rolling metrics， session windows 适合 user behavior analysis。

系统应该根据 query needs 选择 window type。

8️⃣ Late Events and Watermarks

Problem

Events 可能迟到。

示例：

event_time = 10:00
received_at = 10:05

Causes

Mobile offline
Network delay
Client retry
Clock skew
Backpressure

Watermark

Watermark 告诉 processor：

We probably have received most events before time T.

Allowed Lateness

示例：

accept events up to 5 minutes late

👉 面试回答

Real-time analytics 必须处理 late events。

我会同时保存 event time 和 processing time，使用 watermarks，并允许一个有限的 lateness window。

非常迟到的 events 可以进入 correction 或 backfill pipeline。

9️⃣ Deduplication and Exactly-once Effect

为什么会有 Duplicates？

Client retries
Collector retries
Stream processor restarts
At-least-once delivery
Network timeouts

Dedup Key

eventId

或者：

source + eventId

Strategies

Dedup cache
Idempotent writes
支持时使用 exactly-once stream processing
Upsert aggregates
Critical events 存储 processed event IDs

👉 面试回答

实际上 exactly-once delivery 很难。

我会追求 exactly-once effect，使用 event IDs、deduplication、idempotent writes，以及可用时的 transactional stream processing。

对大多数 dashboards，少量 eventual corrections 是可以接受的。

🔟 Real-time Storage / OLAP Store

Requirements

快速 time-range queries
快速 group-by queries
高写入吞吐
Columnar compression
Rollups and aggregations
低延迟 dashboard serving

Options

Apache Pinot
Apache Druid
ClickHouse
Elasticsearch
TimescaleDB
BigQuery BI Engine

Data Model Example

analytics_aggregate (
  metric_name VARCHAR,
  window_start TIMESTAMP,
  window_size VARCHAR,
  dimensions JSON,
  value DOUBLE,
  updated_at TIMESTAMP
)

👉 面试回答

Real-time analytics 需要一个为 OLAP queries 优化的 serving store。

Pinot、Druid 或 ClickHouse 这类系统适合，因为它们支持快速 time-range queries、 group-by aggregations 和高写入吞吐。

1️⃣1️⃣ Raw Events and Backfill

为什么保留 Raw Events？

Reprocess after bug fix
Add new metrics
Backfill dashboards
Audit data
Train models
Correct late data

Storage

Object storage partitioned by date/hour/event type

示例：

s3://events/date=2026-05-03/hour=10/event=purchase/

Replay Flow

Raw events
→ Batch job or replay stream
→ Recompute aggregates
→ Update analytics store

👉 面试回答

我会始终持久保存 raw events。

Real-time pipelines 可能有 bug，所以 raw events 支持 replay、backfill、 metric correction 和创建新 analytics。

1️⃣2️⃣ Query and Dashboard Layer

Dashboard Query Flow

User opens dashboard
→ Query service validates access
→ Reads pre-aggregated data
→ Applies filters/group-by
→ Returns chart data

Query Optimization

Pre-aggregate common metrics
Cache dashboard results
限制 high-cardinality group-bys
长时间范围使用 rollups
可接受时使用 approximate algorithms

Example Queries

Orders per minute by region
Error rate by service over last 15 minutes
Active users in last 5 minutes
P95 latency by endpoint

👉 面试回答

Dashboards 应该查询 pre-aggregated 或 OLAP-optimized data，而不是 raw events。

为了保证 dashboard 快，我会缓存常见 queries，使用 rollups，并限制 high-cardinality group-bys。

1️⃣3️⃣ Alerts

Alert Flow

Stream processor computes metric
→ Rule evaluator checks condition
→ Alert fires
→ Dedup / suppress
→ Notification system

Alert Types

Threshold alert
Anomaly alert
Missing data alert
Rate-of-change alert
Burn-rate alert

Alert Reliability

避免 duplicate alerts
使用 cooldown windows
Group alerts
Track alert state
支持 silence / maintenance windows

👉 面试回答

Real-time analytics 经常用于 alerting。

Alerting system 应该基于 windows 评估 metrics，抑制重复 alerts，支持 cooldown，并将通知 route 给正确 team。

1️⃣4️⃣ High-cardinality Dimensions

Problem

Group-by dimensions 可能爆炸。

示例：

userId
requestId
sessionId
orderId
ipAddress

Impact

High memory usage
Slow queries
Expensive storage
Poor cache hit rate

Strategies

限制 allowed dimensions
预先审批 schemas
使用 approximate aggregations
对低价值 events 采样
High-cardinality drill-down 交给 raw logs
Apply per-tenant quotas

👉 面试回答

High-cardinality dimensions 对 real-time analytics 很危险。

我会限制哪些字段可以作为 dimensions，强制 schema controls，必要时使用 approximations 或 sampling。

Request-level details 通常应该进入 logs，而不是 metrics dashboards。

1️⃣5️⃣ Accuracy vs Latency

Low Latency

优点：

Fresh dashboards
Fast alerting

缺点：

数据更不完整
需要更多 late-event corrections
Compute cost 更高

Higher Accuracy

优点：

Final reports 更准确
Corrections 更少

缺点：

Dashboards 更慢
Alerts 更延迟

Common Strategy

Real-time approximate view now
Batch-corrected accurate view later

👉 面试回答

Real-time analytics 经常用 accuracy 换 freshness。

我会提供快速 near-real-time view，然后用 batch processing 后续修正 late events，生成准确历史 reports。

1️⃣6️⃣ Scaling Patterns

Pattern 1: Durable Event Log

使用 Kafka / Pulsar / Kinesis 吸收流量峰值。

Pattern 2: Partitioned Stream Processing

按 tenant、metric 或 entity 分区。

Pattern 3: Pre-aggregation

Serving query 前预计算常见 metrics。

Pattern 4: Rollups

存储多种分辨率：

1-minute, 5-minute, 1-hour

Pattern 5: Hot / Cold Path

Hot path: stream processing for real-time dashboards
Cold path: batch processing for accuracy and backfill

👉 面试回答

为了扩展 real-time analytics，我会使用 durable event log，对 stream processing 分区，预聚合常见 metrics，存储 rollups，并将 hot real-time processing 和 cold batch correction 分开。

1️⃣7️⃣ Failure Handling

Common Failures

Collector overload
Kafka lag
Stream processor crash
Duplicate events
Late events
OLAP store unavailable
Dashboard query timeout
Bad schema deployment

Strategies

Backpressure
Retry with exponential backoff
Dead-letter queue
Replay from Kafka
Rebuild from raw events
Fallback to cached dashboard
Schema rollback
Data quality alerts

👉 面试回答

系统应该为 replayability 设计。

如果 stream processing 失败，可以从 Kafka replay。

如果 aggregates 错了，可以从 raw events 重建。

如果 dashboards 失败，可以返回 cached 或 degraded data。

1️⃣8️⃣ Consistency Model

需要较强一致性的场景

Access control
Billing-related analytics
Experiment assignment
Critical alert rules
Data deletion / privacy requests

可以最终一致的场景

Dashboards
Aggregated metrics
Funnel reports
Retention reports
Trend analysis
Most alerts with small delay tolerance

👉 面试回答

Real-time analytics 通常是最终一致的。

指标几分钟后被 correction，对很多 dashboards 来说是可以接受的。

但 access control、billing-related metrics、 experiment assignment 和 privacy deletion requests 需要更强正确性。

1️⃣9️⃣ Observability

Key Metrics

Event ingestion QPS
Collector latency
Kafka lag
Stream processing lag
Watermark delay
Late event rate
Duplicate event rate
OLAP write latency
Dashboard query latency
Data freshness
Alert evaluation delay
DLQ event count

👉 面试回答

我会监控 ingestion QPS、queue lag、 stream processing lag、watermark delay、 late event rate、duplicate event rate、 OLAP write latency、dashboard query latency、 data freshness 和 alert delay。

这些指标可以说明 real-time analytics 是否 fresh、correct 和 reliable。

2️⃣0️⃣ End-to-End Flow

Real-time Metric Flow

Event produced
→ Collector validates event
→ Kafka stores event
→ Stream processor aggregates by window
→ OLAP store receives aggregate
→ Dashboard queries latest metric

Backfill Flow

Raw events stored in object storage
→ Batch job reprocesses data
→ Correct aggregates
→ Update historical analytics table

Alert Flow

Stream processor computes metric
→ Alert rule evaluates condition
→ Alert fires
→ Notification sent
→ Alert state tracked

Key Insight

Real-time Analytics 不是简单查询新数据，而是 streaming aggregation、replay、correction 和 serving layers 组成的系统。

🧠 Staff-Level Answer（最终版）

👉 面试回答（完整背诵版）

在设计 Real-time Analytics System 时，我会把它看作一个 streaming aggregation 和 serving platform。

Events 从 clients 和 backend services 收集，由 collectors 进行校验，然后写入 Kafka 这类 durable event log。

Stream processors 会持续消费 events，去重，使用 watermarks 处理 late arrivals，按 time windows 聚合 metrics，并将结果写入 real-time OLAP store。

Dashboards 和 alerting systems 查询 OLAP store，而不是 raw events，因为 dashboards 需要低延迟。

我会将 raw events 保存在 object storage 中，这样当 pipeline 变化或发现 bug 时，系统可以 replay、backfill 和修正 metrics。

Windowed aggregation 是核心。 Tumbling windows 适合固定时间区间， sliding windows 适合 rolling metrics， session windows 适合 user behavior。

High-cardinality dimensions 必须被控制，因为它们会让 storage 和 queries 变得非常昂贵。

Real-time analytics 通常可以最终一致。 Dashboard 可以轻微延迟， late events 可以之后修正。

但 access control、privacy deletion、 experiment assignment 和 billing-related metrics 需要更强正确性。

核心权衡包括 freshness、accuracy、query latency、 compute cost、storage cost 和 operational complexity。

最终目标是为 dashboards 和 alerts 提供 fresh、reliable、low-latency metrics，同时保留 raw data 用于 replay 和 correction。

⭐ Final Insight

Real-time Analytics 的核心不是“查最新数据”，而是一个由 event stream、window aggregation、OLAP serving、replay/backfill 组成的低延迟分析系统。