d&d-t System Design Deep Dive ·

🎯 Design Metrics System

1️⃣ Core Framework

When discussing Metrics System design, I frame it as:

Metrics generation and collection
Ingestion pipeline
Time-series data model
Aggregation and rollup
Storage and retention
Query and dashboard serving
Alerting and anomaly detection
Cardinality, cost, and reliability trade-offs

2️⃣ Core Requirements

Functional Requirements

Applications can emit metrics
Support counters, gauges, histograms, and timers
Support tags / labels
Support time-series queries
Support dashboards
Support alerting rules
Support aggregation and rollups
Support retention policies

Non-functional Requirements

High write throughput
Low ingestion latency
Efficient time-series storage
Fast range queries
High availability
Cost-efficient retention
Eventually consistent metrics are acceptable

👉 Interview Answer

A metrics system collects numerical time-series data from services, stores it efficiently, supports aggregation queries, powers dashboards, and triggers alerts.

The main challenge is handling high write throughput, high-cardinality labels, fast time-range queries, and long-term storage cost.

3️⃣ Metrics vs Logs

Logs

Logs are event records.

Example:

{
  "level": "ERROR",
  "message": "payment failed",
  "traceId": "abc"
}

Metrics

Metrics are numerical measurements over time.

Example:

request_count{service="payment", status="500"} = 1
latency_ms{service="payment"} = 123
cpu_usage{host="h1"} = 72%

Key Difference

System	Data Type	Query Pattern
Logs	Detailed events	Search / debug
Metrics	Numeric time series	Aggregate / monitor

👉 Interview Answer

Logs are detailed event records, while metrics are numerical time-series measurements.

Metrics are optimized for aggregation, dashboards, and alerting, while logs are optimized for search and debugging.

4️⃣ Metric Types

Counter

Monotonically increasing value.

Examples:

request_count
error_count
order_created_count

Used for:

QPS
Error rate
Throughput

Gauge

Value can go up or down.

Examples:

cpu_usage
memory_usage
queue_depth
active_connections

Histogram

Tracks distribution.

Examples:

request_latency
response_size
db_query_duration

Used for:

p50 / p95 / p99 latency
Distribution analysis

Timer

Measures duration.

Examples:

api_latency_ms
job_runtime_seconds

👉 Interview Answer

I would support common metric types: counters for increasing values, gauges for point-in-time values, and histograms or timers for latency distributions.

Histograms are especially important for calculating p95 and p99 latency.

5️⃣ Main APIs / Interfaces

Push Metric

POST /api/metrics

Request:

{
  "name": "request_count",
  "value": 1,
  "timestamp": "2026-05-02T10:00:00Z",
  "type": "counter",
  "labels": {
    "service": "payment-service",
    "endpoint": "/checkout",
    "status": "500",
    "region": "us-east-1"
  }
}

Query Metrics

GET /api/metrics/query?metric=request_count&from=...&to=...

Dashboard Query

POST /api/metrics/query

Request:

{
  "metric": "request_latency",
  "aggregation": "p95",
  "groupBy": ["service", "endpoint"],
  "from": "2026-05-02T09:00:00Z",
  "to": "2026-05-02T10:00:00Z"
}

👉 Interview Answer

Metrics can be pushed directly to the system, but in production, metrics are usually collected by agents or sidecars.

The query API should support time ranges, aggregation functions, filters, and group-by dimensions.

6️⃣ Collection Model

Option 1: Push Model

Application → Metrics System

Pros

Simple for applications
Works well for short-lived jobs

Cons

More client complexity
Harder to control ingestion load

Option 2: Pull Model

Metrics Collector → Scrape Application Endpoint

Example:

GET /metrics

Pros

Centralized control
Easier service discovery
Common in Prometheus-style systems

Cons

Collector must discover targets
Harder for short-lived jobs

Option 3: Agent / Sidecar Model

Application → Local Agent → Metrics Pipeline

Pros

Local buffering
Batching
Reduces app complexity
Good for high scale

👉 Interview Answer

Metrics can be collected through push, pull, or agent-based models.

Pull-based collection gives centralized control, while push-based collection works better for short-lived jobs.

At large scale, using local agents or sidecars can provide batching, buffering, and better isolation from application latency.

7️⃣ Time-Series Data Model

Time Series Definition

A time series is identified by:

metric_name + label set

Example:

request_count{
  service="payment",
  endpoint="/checkout",
  status="500"
}

Each data point:

(timestamp, value)

Sample Row

metric_sample (
  metric_name VARCHAR,
  labels_hash VARCHAR,
  timestamp TIMESTAMP,
  value DOUBLE,
  labels JSON,
  PRIMARY KEY (metric_name, labels_hash, timestamp)
)

Important Concept: Cardinality

Cardinality = number of unique time series.

Example:

request_count{service, endpoint, status}

If:

100 services × 100 endpoints × 5 status codes
= 50,000 time series

👉 Interview Answer

A metric time series is identified by metric name and label set.

Cardinality is one of the most important design concerns.

If labels include high-cardinality fields such as user ID, request ID, or order ID, the number of time series can explode and make the system extremely expensive.

8️⃣ Ingestion Pipeline

Basic Flow

Application / Agent
→ Ingestion Gateway
→ Validation
→ Batching
→ Message Queue
→ Aggregation Workers
→ Time-Series Storage
→ Rollup Storage

Ingestion Gateway Responsibilities

Authentication
Rate limiting
Schema validation
Label validation
Timestamp normalization
Compression handling
Tenant tagging

Aggregation Workers

Responsibilities:

Aggregate samples into time buckets
Compute counters / rates
Compute histograms
Drop invalid labels
Route to storage

👉 Interview Answer

I would design metrics ingestion as a high-throughput pipeline.

Agents batch and send metrics to ingestion gateways.

Gateways validate labels and timestamps, then write samples to a queue.

Aggregation workers process samples, compute rollups, and write them into time-series storage.

9️⃣ Aggregation and Rollups

Why Rollups?

Raw metrics can be too expensive to keep forever.

Example:

1-second resolution for 1 year is very expensive

Rollup Examples

Raw: 1 second resolution → keep 7 days
1-minute rollup → keep 30 days
5-minute rollup → keep 6 months
1-hour rollup → keep 2 years

Common Aggregations

sum
avg
min
max
count
rate
p50 / p95 / p99

Pre-aggregation

Instead of querying raw data every time:

request_count per service per minute

is precomputed.

👉 Interview Answer

Rollups are essential for cost control.

I would store high-resolution raw metrics for a short period, then downsample them into lower-resolution aggregates for long-term retention.

This keeps dashboards and historical queries efficient without storing raw data forever.

🔟 Storage Design

Hot Storage

Used for recent high-resolution data.

Requirements:

Fast writes
Fast time-range reads
Efficient compression

Examples:

Time-series DB
ClickHouse
M3DB
VictoriaMetrics
Prometheus TSDB

Cold Storage

Used for long-term rollups.

Examples:

Object storage
Columnar files
Compressed archives

Partitioning

Partition by:

time bucket
tenant
metric name
label hash

Compression

Time-series data compresses well because:

Timestamps are sequential
Values often change slowly
Labels are repeated

👉 Interview Answer

I would use a time-series optimized storage engine.

Recent high-resolution metrics go to hot storage, while older rolled-up data can be moved to cheaper storage.

Partitioning by time, tenant, metric name, and label hash helps both ingestion and query performance.

1️⃣1️⃣ Query System

Common Queries

QPS by service over last 1 hour
p95 latency by endpoint
error rate by region
CPU usage by host
queue depth over time

Query Flow

User dashboard query
→ Query API
→ Permission check
→ Query planner
→ Select time partitions
→ Fetch raw or rollup data
→ Aggregate / group by
→ Return time series

Query Optimization

Use rollups for long time ranges
Limit group-by cardinality
Cache dashboard queries
Precompute common aggregations
Use downsampling for visualization

👉 Interview Answer

Metrics queries are usually time-range aggregation queries.

The query engine should choose the right resolution: raw data for recent short windows, and rollup data for longer historical ranges.

To keep queries fast, I would limit high-cardinality group-bys and cache common dashboard queries.

1️⃣2️⃣ Alerting System

Alert Rule Example

error_rate{service="payment"} > 5% for 5 minutes

Alert Flow

Metric stream / query
→ Rule evaluator
→ Condition matched
→ Dedup / grouping
→ Notification system
→ On-call user

Alerting Design

Evaluate rules periodically
Support thresholds
Support burn-rate alerts
Support missing data detection
Deduplicate repeated alerts
Route alerts by team / service
Respect silence / maintenance windows

👉 Interview Answer

Alerting can be built on top of metrics.

A rule evaluator periodically checks metric conditions, such as error rate or latency thresholds.

To avoid noisy alerts, the system should support deduplication, grouping, silence windows, and routing rules.

1️⃣3️⃣ Cardinality Control

High-cardinality Labels

Dangerous labels:

user_id
request_id
session_id
order_id
email
ip_address

Why Dangerous?

They create too many unique time series.

This causes:

High memory usage
High storage cost
Slow queries
Ingestion overload

Strategies

Reject disallowed labels
Limit label value count
Cardinality quotas per tenant/service
Drop or hash dangerous labels
Use exemplars/traces for request-level detail
Move high-cardinality details to logs/traces

👉 Interview Answer

Cardinality control is one of the hardest parts of metrics systems.

I would prevent high-cardinality labels like user ID or request ID from being used as metric labels.

Request-level detail should usually go to logs or traces, while metrics should stay aggregated and low-cardinality.

1️⃣4️⃣ Retention and Cost Control

Retention Policy Example

Raw 10-second metrics: 7 days
1-minute rollups: 30 days
5-minute rollups: 6 months
1-hour rollups: 2 years

Cost Control Strategies

Rollups and downsampling
Compression
Cardinality limits
Per-tenant quotas
Drop unused metrics
Sampling for expensive histograms
Cold storage for historical data

👉 Interview Answer

Metrics systems can become expensive because every label combination creates a new time series.

I would control cost using rollups, compression, label cardinality limits, per-tenant quotas, and different retention policies by resolution.

1️⃣5️⃣ Reliability and Failure Handling

Common Failures

Agent cannot send metrics
Ingestion gateway overloaded
Queue backlog
Storage write failure
Query timeout
High-cardinality explosion
Alert evaluator failure

Strategies

Local agent buffering
Batching and compression
Durable queue
Rate limit bad producers
Drop low-priority metrics under pressure
Use DLQ for invalid samples
Fallback to rollups for queries
Monitor ingestion lag

👉 Interview Answer

Metrics systems should handle overload gracefully.

Agents can buffer locally, ingestion gateways can rate limit bad producers, and queues can absorb temporary spikes.

If the system is overloaded, it is usually better to drop low-priority metrics than to impact application traffic.

1️⃣6️⃣ Consistency Model

Stronger Consistency Needed For

Alert rule configuration
Tenant quotas
Access control
Billing-related metrics

Eventual Consistency Acceptable For

Dashboard data
Aggregated metrics
Rollups
Ingestion visibility
Historical query results

👉 Interview Answer

Metrics systems usually accept eventual consistency.

It is acceptable if dashboard data is delayed by a few seconds.

However, alert configuration, access control, and billing-related metrics need stronger correctness.

1️⃣7️⃣ Security and Access Control

Requirements

Tenant isolation
RBAC for dashboards
Restrict production metrics
Audit access to sensitive metrics
Encrypt data in transit and at rest
Prevent labels from containing PII

👉 Interview Answer

Metrics may contain sensitive operational information, so access control is important.

I would enforce tenant isolation, role-based access control, encryption, and label validation to prevent PII from entering metric labels.

1️⃣8️⃣ Observability of the Metrics System

Key Metrics

Ingestion QPS
Samples per second
Queue lag
Dropped samples
Active time-series count
Cardinality growth
Storage write latency
Query latency
Alert evaluation latency
Cost per tenant/service

👉 Interview Answer

The metrics system itself must be observable.

I would monitor ingestion throughput, queue lag, dropped samples, active time-series count, cardinality growth, query latency, and alert evaluation delay.

1️⃣9️⃣ End-to-End Flow

Ingestion Flow

Application emits metric
→ Agent batches metrics
→ Ingestion gateway validates labels
→ Queue buffers samples
→ Aggregation workers compute rollups
→ Time-series storage writes samples

Query Flow

User opens dashboard
→ Query service checks permission
→ Select raw or rollup data
→ Execute time-range aggregation
→ Return time-series result

Alert Flow

Rule evaluator checks metric
→ Condition true for duration
→ Deduplicate alert
→ Notification system sends alert
→ On-call user notified

Key Insight

Metrics System is an aggregation-first time-series system, not a general event search system.

🧠 Staff-Level Answer (Final)

👉 Interview Answer (Full Version)

When designing a metrics system, I think of it as a high-throughput time-series data platform for monitoring, dashboards, and alerting.

Applications emit numerical measurements such as counters, gauges, histograms, and timers.

These metrics are collected by agents or scrapers, batched, validated, and sent through an ingestion pipeline.

A time series is identified by metric name plus labels. Cardinality is a major design concern, because each unique label combination creates a new time series.

I would prevent high-cardinality labels such as user ID, request ID, or order ID from being used as metric labels.

For ingestion, I would use gateways, queues, aggregation workers, and time-series storage.

Recent high-resolution data goes to hot storage, while older data is downsampled into rollups and stored more cheaply.

Query serving should choose the right resolution: raw data for recent short windows, and rollups for long historical windows.

Alerting is built on top of metrics, with rule evaluation, deduplication, grouping, silence windows, and routing.

The main trade-offs are ingestion throughput, query latency, storage cost, retention, cardinality, and consistency.

Metrics can usually be eventually consistent, but alert rules, access control, tenant quotas, and billing-related metrics need stronger correctness.

Ultimately, the goal is to provide reliable, cost-efficient, low-latency visibility into system health and performance at scale.

⭐ Final Insight

Metrics System 的核心不是存每一条事件，而是高效存储、聚合和查询大规模时间序列数据。

中文部分

🎯 Design Metrics System

1️⃣ 核心框架

在设计 Metrics System 时，我通常从以下几个方面来分析：

Metrics 生成和采集
Ingestion pipeline
Time-series 数据模型
聚合和 rollup
存储和 retention
Query 和 dashboard serving
Alerting 和 anomaly detection
Cardinality、成本和可靠性权衡

2️⃣ 核心需求

功能需求

应用可以上报 metrics
支持 counters、gauges、histograms 和 timers
支持 tags / labels
支持 time-series 查询
支持 dashboards
支持 alerting rules
支持 aggregation 和 rollups
支持 retention policies

非功能需求

高写入吞吐
低 ingestion 延迟
高效 time-series 存储
快速范围查询
高可用
长期保留成本可控
Metrics 可以接受最终一致

👉 面试回答

Metrics System 会从服务中收集数值型时间序列数据，高效存储这些数据，支持聚合查询，为 dashboard 提供数据，并触发告警。

核心挑战是处理高写入吞吐、高基数 labels、快速时间范围查询，以及长期存储成本。

3️⃣ Metrics vs Logs

Logs

Logs 是事件记录。

示例：

{
  "level": "ERROR",
  "message": "payment failed",
  "traceId": "abc"
}

Metrics

Metrics 是随时间变化的数值测量。

示例：

request_count{service="payment", status="500"} = 1
latency_ms{service="payment"} = 123
cpu_usage{host="h1"} = 72%

核心区别

System	Data Type	Query Pattern
Logs	详细事件	搜索 / debug
Metrics	数值时间序列	聚合 / 监控

👉 面试回答

Logs 是详细事件记录，而 metrics 是数值型时间序列测量。

Metrics 主要为 aggregation、dashboards 和 alerting 优化， Logs 主要为 search 和 debugging 优化。

4️⃣ Metric Types

Counter

单调递增值。

示例：

request_count
error_count
order_created_count

用于：

QPS
Error rate
Throughput

Gauge

数值可以上升或下降。

示例：

cpu_usage
memory_usage
queue_depth
active_connections

Histogram

记录分布。

示例：

request_latency
response_size
db_query_duration

用于：

p50 / p95 / p99 latency
分布分析

Timer

测量耗时。

示例：

api_latency_ms
job_runtime_seconds

👉 面试回答

我会支持常见 metric types： counters 用于单调递增值， gauges 用于某一时刻的状态值， histograms 或 timers 用于延迟分布。

Histograms 对计算 p95 和 p99 latency 特别重要。

5️⃣ 主要 API / 接口

Push Metric

POST /api/metrics

Request:

{
  "name": "request_count",
  "value": 1,
  "timestamp": "2026-05-02T10:00:00Z",
  "type": "counter",
  "labels": {
    "service": "payment-service",
    "endpoint": "/checkout",
    "status": "500",
    "region": "us-east-1"
  }
}

Query Metrics

GET /api/metrics/query?metric=request_count&from=...&to=...

Dashboard Query

POST /api/metrics/query

Request:

{
  "metric": "request_latency",
  "aggregation": "p95",
  "groupBy": ["service", "endpoint"],
  "from": "2026-05-02T09:00:00Z",
  "to": "2026-05-02T10:00:00Z"
}

👉 面试回答

Metrics 可以直接 push 到系统，但生产环境中 metrics 通常由 agents 或 sidecars 采集。

Query API 应该支持时间范围、 aggregation functions、 filters 和 group-by dimensions。

6️⃣ Collection Model

方案 1：Push Model

Application → Metrics System

优点

对应用来说简单
适合 short-lived jobs

缺点

Client 逻辑更复杂
更难控制 ingestion load

方案 2：Pull Model

Metrics Collector → Scrape Application Endpoint

示例：

GET /metrics

优点

集中控制
服务发现更容易
Prometheus 风格系统常见

缺点

Collector 需要发现 targets
对 short-lived jobs 不够友好

方案 3：Agent / Sidecar Model

Application → Local Agent → Metrics Pipeline

优点

本地 buffer
Batching
降低应用复杂度
适合大规模系统

👉 面试回答

Metrics 可以通过 push、pull 或 agent-based 模型采集。

Pull-based collection 提供更强的集中控制， push-based collection 更适合 short-lived jobs。

在大规模系统中， local agents 或 sidecars 可以提供 batching、buffering，并减少对 application latency 的影响。

7️⃣ Time-Series Data Model

Time Series 定义

一个 time series 由以下内容唯一确定：

metric_name + label set

示例：

request_count{
  service="payment",
  endpoint="/checkout",
  status="500"
}

每个数据点：

(timestamp, value)

Sample Row

metric_sample (
  metric_name VARCHAR,
  labels_hash VARCHAR,
  timestamp TIMESTAMP,
  value DOUBLE,
  labels JSON,
  PRIMARY KEY (metric_name, labels_hash, timestamp)
)

重要概念：Cardinality

Cardinality = unique time series 的数量。

示例：

request_count{service, endpoint, status}

如果：

100 services × 100 endpoints × 5 status codes
= 50,000 time series

👉 面试回答

一个 metric time series 由 metric name 和 label set 标识。

Cardinality 是 metrics system 中最重要的设计问题之一。

如果 labels 包含 user ID、request ID 或 order ID 这类高基数字段， time series 数量会爆炸，系统成本会变得非常高。

8️⃣ Ingestion Pipeline

基本流程

Application / Agent
→ Ingestion Gateway
→ Validation
→ Batching
→ Message Queue
→ Aggregation Workers
→ Time-Series Storage
→ Rollup Storage

Ingestion Gateway 职责

Authentication
Rate limiting
Schema validation
Label validation
Timestamp normalization
Compression handling
Tenant tagging

Aggregation Workers 职责

将 samples 聚合到 time buckets
计算 counters / rates
计算 histograms
丢弃非法 labels
路由到 storage

👉 面试回答

我会将 metrics ingestion 设计成高吞吐 pipeline。

Agents 批量发送 metrics 到 ingestion gateways。

Gateways 校验 labels 和 timestamps，然后将 samples 写入 queue。

Aggregation workers 处理 samples，计算 rollups，并写入 time-series storage。

9️⃣ Aggregation and Rollups

为什么需要 Rollups？

Raw metrics 如果长期保存会非常昂贵。

例如：

1-second resolution for 1 year is very expensive

Rollup 示例

Raw: 1 second resolution → keep 7 days
1-minute rollup → keep 30 days
5-minute rollup → keep 6 months
1-hour rollup → keep 2 years

常见 Aggregations

sum
avg
min
max
count
rate
p50 / p95 / p99

Pre-aggregation

不要每次都查询 raw data：

request_count per service per minute

可以提前计算。

👉 面试回答

Rollups 对成本控制非常重要。

我会短期保存高分辨率 raw metrics，然后将它们 downsample 成更低分辨率的 aggregates 用于长期保存。

这样可以保持 dashboard 和历史查询效率，同时避免永久保存 raw data。

🔟 存储设计

Hot Storage

用于近期高分辨率数据。

要求：

快速写入
快速时间范围查询
高效压缩

例如：

Time-series DB
ClickHouse
M3DB
VictoriaMetrics
Prometheus TSDB

Cold Storage

用于长期 rollups。

例如：

Object storage
Columnar files
Compressed archives

Partitioning

按以下维度分区：

time bucket
tenant
metric name
label hash

Compression

Time-series data 很适合压缩，因为：

Timestamps 连续
Values 通常变化缓慢
Labels 重复出现

👉 面试回答

我会使用 time-series optimized storage engine。

最近的高分辨率 metrics 存在 hot storage，老数据经过 rollup 后可以迁移到更便宜的存储。

按 time、tenant、metric name 和 label hash 分区，有助于提升 ingestion 和 query performance。

1️⃣1️⃣ Query System

常见 Queries

QPS by service over last 1 hour
p95 latency by endpoint
error rate by region
CPU usage by host
queue depth over time

Query Flow

User dashboard query
→ Query API
→ Permission check
→ Query planner
→ Select time partitions
→ Fetch raw or rollup data
→ Aggregate / group by
→ Return time series

Query Optimization

长时间范围使用 rollups
限制 group-by cardinality
缓存 dashboard queries
预计算常见 aggregations
为可视化做 downsampling

👉 面试回答

Metrics queries 通常是时间范围聚合查询。

Query engine 应该选择合适的数据分辨率：短时间窗口使用 raw data，长时间历史查询使用 rollup data。

为了保持查询快速，我会限制高基数 group-by，并缓存常见 dashboard queries。

1️⃣2️⃣ Alerting System

Alert Rule Example

error_rate{service="payment"} > 5% for 5 minutes

Alert Flow

Metric stream / query
→ Rule evaluator
→ Condition matched
→ Dedup / grouping
→ Notification system
→ On-call user

Alerting Design

定期执行 rule evaluation
支持 threshold
支持 burn-rate alerts
支持 missing data detection
Deduplicate repeated alerts
按 team / service route alerts
支持 silence / maintenance windows

👉 面试回答

Alerting 可以构建在 metrics 之上。

Rule evaluator 会周期性检查 metric conditions，例如 error rate 或 latency 是否超过阈值。

为了避免 noisy alerts，系统应该支持 deduplication、grouping、 silence windows 和 routing rules。

1️⃣3️⃣ Cardinality Control

High-cardinality Labels

危险 labels：

user_id
request_id
session_id
order_id
email
ip_address

为什么危险？

它们会创建太多 unique time series。

这会导致：

内存使用高
存储成本高
查询变慢
Ingestion overload

策略

拒绝不允许的 labels
限制 label value count
对每个 tenant / service 设置 cardinality quota
Drop 或 hash 危险 labels
使用 exemplars / traces 处理 request-level detail
将高基数细节放到 logs / traces 中

👉 面试回答

Cardinality control 是 metrics system 中最难的问题之一。

我会防止 user ID、request ID 这类高基数字段被用作 metric labels。

Request-level detail 通常应该进入 logs 或 traces，而 metrics 应该保持聚合和低基数。

1️⃣4️⃣ Retention and Cost Control

Retention Policy Example

Raw 10-second metrics: 7 days
1-minute rollups: 30 days
5-minute rollups: 6 months
1-hour rollups: 2 years

Cost Control Strategies

Rollups and downsampling
Compression
Cardinality limits
Per-tenant quotas
Drop unused metrics
Sampling for expensive histograms
Cold storage for historical data

👉 面试回答

Metrics system 可能很昂贵，因为每个 label combination 都会创建一个新的 time series。

我会通过 rollups、compression、 label cardinality limits、per-tenant quotas 和不同分辨率的 retention policies 来控制成本。

1️⃣5️⃣ Reliability and Failure Handling

Common Failures

Agent cannot send metrics
Ingestion gateway overloaded
Queue backlog
Storage write failure
Query timeout
High-cardinality explosion
Alert evaluator failure

Strategies

Local agent buffering
Batching and compression
Durable queue
Rate limit bad producers
Drop low-priority metrics under pressure
Use DLQ for invalid samples
Fallback to rollups for queries
Monitor ingestion lag

👉 面试回答

Metrics system 应该能优雅处理过载。

Agents 可以本地 buffer， ingestion gateways 可以对异常 producer 限流， queues 可以吸收临时流量峰值。

如果系统过载，通常宁可丢弃低优先级 metrics，也不应该影响应用自身流量。

1️⃣6️⃣ Consistency Model

需要较强一致性的场景

Alert rule configuration
Tenant quotas
Access control
Billing-related metrics

可以最终一致的场景

Dashboard data
Aggregated metrics
Rollups
Ingestion visibility
Historical query results

👉 面试回答

Metrics system 通常可以接受最终一致。

Dashboard 数据延迟几秒通常是可以接受的。

但是 alert configuration、access control、 tenant quotas 和 billing-related metrics 需要更强正确性。

1️⃣7️⃣ Security and Access Control

Requirements

Tenant isolation
RBAC for dashboards
Restrict production metrics
Audit access to sensitive metrics
Encrypt data in transit and at rest
Prevent labels from containing PII

👉 面试回答

Metrics 可能包含敏感的运维信息，所以 access control 很重要。

我会强制 tenant isolation、RBAC、encryption，并进行 label validation，防止 PII 被写入 metric labels。

1️⃣8️⃣ Metrics System 自身的可观测性

Key Metrics

Ingestion QPS
Samples per second
Queue lag
Dropped samples
Active time-series count
Cardinality growth
Storage write latency
Query latency
Alert evaluation latency
Cost per tenant / service

👉 面试回答

Metrics system 自身也必须可观测。

我会监控 ingestion throughput、queue lag、 dropped samples、active time-series count、 cardinality growth、query latency 和 alert evaluation delay。

1️⃣9️⃣ End-to-End Flow

Ingestion Flow

Application emits metric
→ Agent batches metrics
→ Ingestion gateway validates labels
→ Queue buffers samples
→ Aggregation workers compute rollups
→ Time-series storage writes samples

Query Flow

User opens dashboard
→ Query service checks permission
→ Select raw or rollup data
→ Execute time-range aggregation
→ Return time-series result

Alert Flow

Rule evaluator checks metric
→ Condition true for duration
→ Deduplicate alert
→ Notification system sends alert
→ On-call user notified

Key Insight

Metrics System 是 aggregation-first 的 time-series system，不是通用事件搜索系统。

🧠 Staff-Level Answer（最终版）

👉 面试回答（完整背诵版）

在设计 Metrics System 时，我会把它看作一个高吞吐的 time-series data platform，用于 monitoring、dashboards 和 alerting。

应用会上报 counters、gauges、histograms 和 timers 这类数值型 metrics。

这些 metrics 通过 agents 或 scrapers 采集，经过 batching、validation，再进入 ingestion pipeline。

一个 time series 由 metric name 加 labels 唯一确定。 Cardinality 是核心设计问题，因为每一种不同的 label combination 都会创建新的 time series。

我会禁止 user ID、request ID 或 order ID 这类高基数字段作为 metric labels。

对于 ingestion，我会使用 gateways、queues、aggregation workers 和 time-series storage。

最近的高分辨率数据进入 hot storage，老数据会 downsample 成 rollups，并使用更低成本的方式保存。

Query serving 应该选择合适的数据分辨率：短时间窗口使用 raw data，长历史窗口使用 rollups。

Alerting 构建在 metrics 之上，包括 rule evaluation、deduplication、grouping、 silence windows 和 routing。

核心权衡包括 ingestion throughput、query latency、 storage cost、retention、cardinality 和 consistency。

Metrics 通常可以最终一致，但 alert rules、access control、tenant quotas 和 billing-related metrics 需要更强正确性。

最终目标是在大规模下，以可靠、低成本、低延迟的方式提供系统健康状态和性能可见性。

⭐ Final Insight

Metrics System 的核心不是存每一条事件，而是高效存储、聚合和查询大规模时间序列数据。