System Design Deep Dive - 09 Design Metrics System

Post by ailswan May. 02, 2026

中文 ↓

🎯 Design Metrics System


1️⃣ Core Framework

When discussing Metrics System design, I frame it as:

  1. Metrics generation and collection
  2. Ingestion pipeline
  3. Time-series data model
  4. Aggregation and rollup
  5. Storage and retention
  6. Query and dashboard serving
  7. Alerting and anomaly detection
  8. Cardinality, cost, and reliability trade-offs

2️⃣ Core Requirements


Functional Requirements


Non-functional Requirements


👉 Interview Answer

A metrics system collects numerical time-series data from services, stores it efficiently, supports aggregation queries, powers dashboards, and triggers alerts.

The main challenge is handling high write throughput, high-cardinality labels, fast time-range queries, and long-term storage cost.


3️⃣ Metrics vs Logs


Logs

Logs are event records.

Example:

{
  "level": "ERROR",
  "message": "payment failed",
  "traceId": "abc"
}

Metrics

Metrics are numerical measurements over time.

Example:

request_count{service="payment", status="500"} = 1
latency_ms{service="payment"} = 123
cpu_usage{host="h1"} = 72%

Key Difference

System Data Type Query Pattern
Logs Detailed events Search / debug
Metrics Numeric time series Aggregate / monitor

👉 Interview Answer

Logs are detailed event records, while metrics are numerical time-series measurements.

Metrics are optimized for aggregation, dashboards, and alerting, while logs are optimized for search and debugging.


4️⃣ Metric Types


Counter

Monotonically increasing value.

Examples:

request_count
error_count
order_created_count

Used for:


Gauge

Value can go up or down.

Examples:

cpu_usage
memory_usage
queue_depth
active_connections

Histogram

Tracks distribution.

Examples:

request_latency
response_size
db_query_duration

Used for:


Timer

Measures duration.

Examples:

api_latency_ms
job_runtime_seconds

👉 Interview Answer

I would support common metric types: counters for increasing values, gauges for point-in-time values, and histograms or timers for latency distributions.

Histograms are especially important for calculating p95 and p99 latency.


5️⃣ Main APIs / Interfaces


Push Metric

POST /api/metrics

Request:

{
  "name": "request_count",
  "value": 1,
  "timestamp": "2026-05-02T10:00:00Z",
  "type": "counter",
  "labels": {
    "service": "payment-service",
    "endpoint": "/checkout",
    "status": "500",
    "region": "us-east-1"
  }
}

Query Metrics

GET /api/metrics/query?metric=request_count&from=...&to=...

Dashboard Query

POST /api/metrics/query

Request:

{
  "metric": "request_latency",
  "aggregation": "p95",
  "groupBy": ["service", "endpoint"],
  "from": "2026-05-02T09:00:00Z",
  "to": "2026-05-02T10:00:00Z"
}

👉 Interview Answer

Metrics can be pushed directly to the system, but in production, metrics are usually collected by agents or sidecars.

The query API should support time ranges, aggregation functions, filters, and group-by dimensions.


6️⃣ Collection Model


Option 1: Push Model

Application → Metrics System

Pros

Cons


Option 2: Pull Model

Metrics Collector → Scrape Application Endpoint

Example:

GET /metrics

Pros

Cons


Option 3: Agent / Sidecar Model

Application → Local Agent → Metrics Pipeline

Pros


👉 Interview Answer

Metrics can be collected through push, pull, or agent-based models.

Pull-based collection gives centralized control, while push-based collection works better for short-lived jobs.

At large scale, using local agents or sidecars can provide batching, buffering, and better isolation from application latency.


7️⃣ Time-Series Data Model


Time Series Definition

A time series is identified by:

metric_name + label set

Example:

request_count{
  service="payment",
  endpoint="/checkout",
  status="500"
}

Each data point:

(timestamp, value)

Sample Row

metric_sample (
  metric_name VARCHAR,
  labels_hash VARCHAR,
  timestamp TIMESTAMP,
  value DOUBLE,
  labels JSON,
  PRIMARY KEY (metric_name, labels_hash, timestamp)
)

Important Concept: Cardinality

Cardinality = number of unique time series.

Example:

request_count{service, endpoint, status}

If:

100 services × 100 endpoints × 5 status codes
= 50,000 time series

👉 Interview Answer

A metric time series is identified by metric name and label set.

Cardinality is one of the most important design concerns.

If labels include high-cardinality fields such as user ID, request ID, or order ID, the number of time series can explode and make the system extremely expensive.


8️⃣ Ingestion Pipeline


Basic Flow

Application / Agent
→ Ingestion Gateway
→ Validation
→ Batching
→ Message Queue
→ Aggregation Workers
→ Time-Series Storage
→ Rollup Storage

Ingestion Gateway Responsibilities


Aggregation Workers

Responsibilities:


👉 Interview Answer

I would design metrics ingestion as a high-throughput pipeline.

Agents batch and send metrics to ingestion gateways.

Gateways validate labels and timestamps, then write samples to a queue.

Aggregation workers process samples, compute rollups, and write them into time-series storage.


9️⃣ Aggregation and Rollups


Why Rollups?

Raw metrics can be too expensive to keep forever.

Example:

1-second resolution for 1 year is very expensive

Rollup Examples

Raw: 1 second resolution → keep 7 days
1-minute rollup → keep 30 days
5-minute rollup → keep 6 months
1-hour rollup → keep 2 years

Common Aggregations


Pre-aggregation

Instead of querying raw data every time:

request_count per service per minute

is precomputed.


👉 Interview Answer

Rollups are essential for cost control.

I would store high-resolution raw metrics for a short period, then downsample them into lower-resolution aggregates for long-term retention.

This keeps dashboards and historical queries efficient without storing raw data forever.


🔟 Storage Design


Hot Storage

Used for recent high-resolution data.

Requirements:

Examples:

Time-series DB
ClickHouse
M3DB
VictoriaMetrics
Prometheus TSDB

Cold Storage

Used for long-term rollups.

Examples:

Object storage
Columnar files
Compressed archives

Partitioning

Partition by:

time bucket
tenant
metric name
label hash

Compression

Time-series data compresses well because:


👉 Interview Answer

I would use a time-series optimized storage engine.

Recent high-resolution metrics go to hot storage, while older rolled-up data can be moved to cheaper storage.

Partitioning by time, tenant, metric name, and label hash helps both ingestion and query performance.


1️⃣1️⃣ Query System


Common Queries

QPS by service over last 1 hour
p95 latency by endpoint
error rate by region
CPU usage by host
queue depth over time

Query Flow

User dashboard query
→ Query API
→ Permission check
→ Query planner
→ Select time partitions
→ Fetch raw or rollup data
→ Aggregate / group by
→ Return time series

Query Optimization


👉 Interview Answer

Metrics queries are usually time-range aggregation queries.

The query engine should choose the right resolution: raw data for recent short windows, and rollup data for longer historical ranges.

To keep queries fast, I would limit high-cardinality group-bys and cache common dashboard queries.


1️⃣2️⃣ Alerting System


Alert Rule Example

error_rate{service="payment"} > 5% for 5 minutes

Alert Flow

Metric stream / query
→ Rule evaluator
→ Condition matched
→ Dedup / grouping
→ Notification system
→ On-call user

Alerting Design


👉 Interview Answer

Alerting can be built on top of metrics.

A rule evaluator periodically checks metric conditions, such as error rate or latency thresholds.

To avoid noisy alerts, the system should support deduplication, grouping, silence windows, and routing rules.


1️⃣3️⃣ Cardinality Control


High-cardinality Labels

Dangerous labels:

user_id
request_id
session_id
order_id
email
ip_address

Why Dangerous?

They create too many unique time series.

This causes:


Strategies


👉 Interview Answer

Cardinality control is one of the hardest parts of metrics systems.

I would prevent high-cardinality labels like user ID or request ID from being used as metric labels.

Request-level detail should usually go to logs or traces, while metrics should stay aggregated and low-cardinality.


1️⃣4️⃣ Retention and Cost Control


Retention Policy Example

Raw 10-second metrics: 7 days
1-minute rollups: 30 days
5-minute rollups: 6 months
1-hour rollups: 2 years

Cost Control Strategies


👉 Interview Answer

Metrics systems can become expensive because every label combination creates a new time series.

I would control cost using rollups, compression, label cardinality limits, per-tenant quotas, and different retention policies by resolution.


1️⃣5️⃣ Reliability and Failure Handling


Common Failures


Strategies


👉 Interview Answer

Metrics systems should handle overload gracefully.

Agents can buffer locally, ingestion gateways can rate limit bad producers, and queues can absorb temporary spikes.

If the system is overloaded, it is usually better to drop low-priority metrics than to impact application traffic.


1️⃣6️⃣ Consistency Model


Stronger Consistency Needed For


Eventual Consistency Acceptable For


👉 Interview Answer

Metrics systems usually accept eventual consistency.

It is acceptable if dashboard data is delayed by a few seconds.

However, alert configuration, access control, and billing-related metrics need stronger correctness.


1️⃣7️⃣ Security and Access Control


Requirements


👉 Interview Answer

Metrics may contain sensitive operational information, so access control is important.

I would enforce tenant isolation, role-based access control, encryption, and label validation to prevent PII from entering metric labels.


1️⃣8️⃣ Observability of the Metrics System


Key Metrics


👉 Interview Answer

The metrics system itself must be observable.

I would monitor ingestion throughput, queue lag, dropped samples, active time-series count, cardinality growth, query latency, and alert evaluation delay.


1️⃣9️⃣ End-to-End Flow


Ingestion Flow

Application emits metric
→ Agent batches metrics
→ Ingestion gateway validates labels
→ Queue buffers samples
→ Aggregation workers compute rollups
→ Time-series storage writes samples

Query Flow

User opens dashboard
→ Query service checks permission
→ Select raw or rollup data
→ Execute time-range aggregation
→ Return time-series result

Alert Flow

Rule evaluator checks metric
→ Condition true for duration
→ Deduplicate alert
→ Notification system sends alert
→ On-call user notified

Key Insight

Metrics System is an aggregation-first time-series system, not a general event search system.


🧠 Staff-Level Answer (Final)


👉 Interview Answer (Full Version)

When designing a metrics system, I think of it as a high-throughput time-series data platform for monitoring, dashboards, and alerting.

Applications emit numerical measurements such as counters, gauges, histograms, and timers.

These metrics are collected by agents or scrapers, batched, validated, and sent through an ingestion pipeline.

A time series is identified by metric name plus labels. Cardinality is a major design concern, because each unique label combination creates a new time series.

I would prevent high-cardinality labels such as user ID, request ID, or order ID from being used as metric labels.

For ingestion, I would use gateways, queues, aggregation workers, and time-series storage.

Recent high-resolution data goes to hot storage, while older data is downsampled into rollups and stored more cheaply.

Query serving should choose the right resolution: raw data for recent short windows, and rollups for long historical windows.

Alerting is built on top of metrics, with rule evaluation, deduplication, grouping, silence windows, and routing.

The main trade-offs are ingestion throughput, query latency, storage cost, retention, cardinality, and consistency.

Metrics can usually be eventually consistent, but alert rules, access control, tenant quotas, and billing-related metrics need stronger correctness.

Ultimately, the goal is to provide reliable, cost-efficient, low-latency visibility into system health and performance at scale.


⭐ Final Insight

Metrics System 的核心不是存每一条事件, 而是高效存储、聚合和查询大规模时间序列数据。



中文部分


🎯 Design Metrics System


1️⃣ 核心框架

在设计 Metrics System 时,我通常从以下几个方面来分析:

  1. Metrics 生成和采集
  2. Ingestion pipeline
  3. Time-series 数据模型
  4. 聚合和 rollup
  5. 存储和 retention
  6. Query 和 dashboard serving
  7. Alerting 和 anomaly detection
  8. Cardinality、成本和可靠性权衡

2️⃣ 核心需求


功能需求


非功能需求


👉 面试回答

Metrics System 会从服务中收集数值型时间序列数据, 高效存储这些数据, 支持聚合查询, 为 dashboard 提供数据, 并触发告警。

核心挑战是处理高写入吞吐、 高基数 labels、 快速时间范围查询, 以及长期存储成本。


3️⃣ Metrics vs Logs


Logs

Logs 是事件记录。

示例:

{
  "level": "ERROR",
  "message": "payment failed",
  "traceId": "abc"
}

Metrics

Metrics 是随时间变化的数值测量。

示例:

request_count{service="payment", status="500"} = 1
latency_ms{service="payment"} = 123
cpu_usage{host="h1"} = 72%

核心区别

System Data Type Query Pattern
Logs 详细事件 搜索 / debug
Metrics 数值时间序列 聚合 / 监控

👉 面试回答

Logs 是详细事件记录, 而 metrics 是数值型时间序列测量。

Metrics 主要为 aggregation、dashboards 和 alerting 优化, Logs 主要为 search 和 debugging 优化。


4️⃣ Metric Types


Counter

单调递增值。

示例:

request_count
error_count
order_created_count

用于:


Gauge

数值可以上升或下降。

示例:

cpu_usage
memory_usage
queue_depth
active_connections

Histogram

记录分布。

示例:

request_latency
response_size
db_query_duration

用于:


Timer

测量耗时。

示例:

api_latency_ms
job_runtime_seconds

👉 面试回答

我会支持常见 metric types: counters 用于单调递增值, gauges 用于某一时刻的状态值, histograms 或 timers 用于延迟分布。

Histograms 对计算 p95 和 p99 latency 特别重要。


5️⃣ 主要 API / 接口


Push Metric

POST /api/metrics

Request:

{
  "name": "request_count",
  "value": 1,
  "timestamp": "2026-05-02T10:00:00Z",
  "type": "counter",
  "labels": {
    "service": "payment-service",
    "endpoint": "/checkout",
    "status": "500",
    "region": "us-east-1"
  }
}

Query Metrics

GET /api/metrics/query?metric=request_count&from=...&to=...

Dashboard Query

POST /api/metrics/query

Request:

{
  "metric": "request_latency",
  "aggregation": "p95",
  "groupBy": ["service", "endpoint"],
  "from": "2026-05-02T09:00:00Z",
  "to": "2026-05-02T10:00:00Z"
}

👉 面试回答

Metrics 可以直接 push 到系统, 但生产环境中 metrics 通常由 agents 或 sidecars 采集。

Query API 应该支持时间范围、 aggregation functions、 filters 和 group-by dimensions。


6️⃣ Collection Model


方案 1:Push Model

Application → Metrics System

优点

缺点


方案 2:Pull Model

Metrics Collector → Scrape Application Endpoint

示例:

GET /metrics

优点

缺点


方案 3:Agent / Sidecar Model

Application → Local Agent → Metrics Pipeline

优点


👉 面试回答

Metrics 可以通过 push、pull 或 agent-based 模型采集。

Pull-based collection 提供更强的集中控制, push-based collection 更适合 short-lived jobs。

在大规模系统中, local agents 或 sidecars 可以提供 batching、buffering, 并减少对 application latency 的影响。


7️⃣ Time-Series Data Model


Time Series 定义

一个 time series 由以下内容唯一确定:

metric_name + label set

示例:

request_count{
  service="payment",
  endpoint="/checkout",
  status="500"
}

每个数据点:

(timestamp, value)

Sample Row

metric_sample (
  metric_name VARCHAR,
  labels_hash VARCHAR,
  timestamp TIMESTAMP,
  value DOUBLE,
  labels JSON,
  PRIMARY KEY (metric_name, labels_hash, timestamp)
)

重要概念:Cardinality

Cardinality = unique time series 的数量。

示例:

request_count{service, endpoint, status}

如果:

100 services × 100 endpoints × 5 status codes
= 50,000 time series

👉 面试回答

一个 metric time series 由 metric name 和 label set 标识。

Cardinality 是 metrics system 中最重要的设计问题之一。

如果 labels 包含 user ID、request ID 或 order ID 这类高基数字段, time series 数量会爆炸, 系统成本会变得非常高。


8️⃣ Ingestion Pipeline


基本流程

Application / Agent
→ Ingestion Gateway
→ Validation
→ Batching
→ Message Queue
→ Aggregation Workers
→ Time-Series Storage
→ Rollup Storage

Ingestion Gateway 职责


Aggregation Workers 职责


👉 面试回答

我会将 metrics ingestion 设计成高吞吐 pipeline。

Agents 批量发送 metrics 到 ingestion gateways。

Gateways 校验 labels 和 timestamps, 然后将 samples 写入 queue。

Aggregation workers 处理 samples, 计算 rollups, 并写入 time-series storage。


9️⃣ Aggregation and Rollups


为什么需要 Rollups?

Raw metrics 如果长期保存会非常昂贵。

例如:

1-second resolution for 1 year is very expensive

Rollup 示例

Raw: 1 second resolution → keep 7 days
1-minute rollup → keep 30 days
5-minute rollup → keep 6 months
1-hour rollup → keep 2 years

常见 Aggregations


Pre-aggregation

不要每次都查询 raw data:

request_count per service per minute

可以提前计算。


👉 面试回答

Rollups 对成本控制非常重要。

我会短期保存高分辨率 raw metrics, 然后将它们 downsample 成更低分辨率的 aggregates 用于长期保存。

这样可以保持 dashboard 和历史查询效率, 同时避免永久保存 raw data。


🔟 存储设计


Hot Storage

用于近期高分辨率数据。

要求:

例如:

Time-series DB
ClickHouse
M3DB
VictoriaMetrics
Prometheus TSDB

Cold Storage

用于长期 rollups。

例如:

Object storage
Columnar files
Compressed archives

Partitioning

按以下维度分区:

time bucket
tenant
metric name
label hash

Compression

Time-series data 很适合压缩,因为:


👉 面试回答

我会使用 time-series optimized storage engine。

最近的高分辨率 metrics 存在 hot storage, 老数据经过 rollup 后可以迁移到更便宜的存储。

按 time、tenant、metric name 和 label hash 分区, 有助于提升 ingestion 和 query performance。


1️⃣1️⃣ Query System


常见 Queries

QPS by service over last 1 hour
p95 latency by endpoint
error rate by region
CPU usage by host
queue depth over time

Query Flow

User dashboard query
→ Query API
→ Permission check
→ Query planner
→ Select time partitions
→ Fetch raw or rollup data
→ Aggregate / group by
→ Return time series

Query Optimization


👉 面试回答

Metrics queries 通常是时间范围聚合查询。

Query engine 应该选择合适的数据分辨率: 短时间窗口使用 raw data, 长时间历史查询使用 rollup data。

为了保持查询快速, 我会限制高基数 group-by, 并缓存常见 dashboard queries。


1️⃣2️⃣ Alerting System


Alert Rule Example

error_rate{service="payment"} > 5% for 5 minutes

Alert Flow

Metric stream / query
→ Rule evaluator
→ Condition matched
→ Dedup / grouping
→ Notification system
→ On-call user

Alerting Design


👉 面试回答

Alerting 可以构建在 metrics 之上。

Rule evaluator 会周期性检查 metric conditions, 例如 error rate 或 latency 是否超过阈值。

为了避免 noisy alerts, 系统应该支持 deduplication、grouping、 silence windows 和 routing rules。


1️⃣3️⃣ Cardinality Control


High-cardinality Labels

危险 labels:

user_id
request_id
session_id
order_id
email
ip_address

为什么危险?

它们会创建太多 unique time series。

这会导致:


策略


👉 面试回答

Cardinality control 是 metrics system 中最难的问题之一。

我会防止 user ID、request ID 这类高基数字段 被用作 metric labels。

Request-level detail 通常应该进入 logs 或 traces, 而 metrics 应该保持聚合和低基数。


1️⃣4️⃣ Retention and Cost Control


Retention Policy Example

Raw 10-second metrics: 7 days
1-minute rollups: 30 days
5-minute rollups: 6 months
1-hour rollups: 2 years

Cost Control Strategies


👉 面试回答

Metrics system 可能很昂贵, 因为每个 label combination 都会创建一个新的 time series。

我会通过 rollups、compression、 label cardinality limits、per-tenant quotas 和不同分辨率的 retention policies 来控制成本。


1️⃣5️⃣ Reliability and Failure Handling


Common Failures


Strategies


👉 面试回答

Metrics system 应该能优雅处理过载。

Agents 可以本地 buffer, ingestion gateways 可以对异常 producer 限流, queues 可以吸收临时流量峰值。

如果系统过载, 通常宁可丢弃低优先级 metrics, 也不应该影响应用自身流量。


1️⃣6️⃣ Consistency Model


需要较强一致性的场景


可以最终一致的场景


👉 面试回答

Metrics system 通常可以接受最终一致。

Dashboard 数据延迟几秒通常是可以接受的。

但是 alert configuration、access control、 tenant quotas 和 billing-related metrics 需要更强正确性。


1️⃣7️⃣ Security and Access Control


Requirements


👉 面试回答

Metrics 可能包含敏感的运维信息, 所以 access control 很重要。

我会强制 tenant isolation、RBAC、encryption, 并进行 label validation, 防止 PII 被写入 metric labels。


1️⃣8️⃣ Metrics System 自身的可观测性


Key Metrics


👉 面试回答

Metrics system 自身也必须可观测。

我会监控 ingestion throughput、queue lag、 dropped samples、active time-series count、 cardinality growth、query latency 和 alert evaluation delay。


1️⃣9️⃣ End-to-End Flow


Ingestion Flow

Application emits metric
→ Agent batches metrics
→ Ingestion gateway validates labels
→ Queue buffers samples
→ Aggregation workers compute rollups
→ Time-series storage writes samples

Query Flow

User opens dashboard
→ Query service checks permission
→ Select raw or rollup data
→ Execute time-range aggregation
→ Return time-series result

Alert Flow

Rule evaluator checks metric
→ Condition true for duration
→ Deduplicate alert
→ Notification system sends alert
→ On-call user notified

Key Insight

Metrics System 是 aggregation-first 的 time-series system, 不是通用事件搜索系统。


🧠 Staff-Level Answer(最终版)


👉 面试回答(完整背诵版)

在设计 Metrics System 时, 我会把它看作一个高吞吐的 time-series data platform, 用于 monitoring、dashboards 和 alerting。

应用会上报 counters、gauges、histograms 和 timers 这类数值型 metrics。

这些 metrics 通过 agents 或 scrapers 采集, 经过 batching、validation, 再进入 ingestion pipeline。

一个 time series 由 metric name 加 labels 唯一确定。 Cardinality 是核心设计问题, 因为每一种不同的 label combination 都会创建新的 time series。

我会禁止 user ID、request ID 或 order ID 这类高基数字段作为 metric labels。

对于 ingestion, 我会使用 gateways、queues、aggregation workers 和 time-series storage。

最近的高分辨率数据进入 hot storage, 老数据会 downsample 成 rollups, 并使用更低成本的方式保存。

Query serving 应该选择合适的数据分辨率: 短时间窗口使用 raw data, 长历史窗口使用 rollups。

Alerting 构建在 metrics 之上, 包括 rule evaluation、deduplication、grouping、 silence windows 和 routing。

核心权衡包括 ingestion throughput、query latency、 storage cost、retention、cardinality 和 consistency。

Metrics 通常可以最终一致, 但 alert rules、access control、tenant quotas 和 billing-related metrics 需要更强正确性。

最终目标是在大规模下, 以可靠、低成本、低延迟的方式 提供系统健康状态和性能可见性。


⭐ Final Insight

Metrics System 的核心不是存每一条事件, 而是高效存储、聚合和查询大规模时间序列数据。

Implement