🎯 Design Metrics System
1️⃣ Core Framework
When discussing Metrics System design, I frame it as:
- Metrics generation and collection
- Ingestion pipeline
- Time-series data model
- Aggregation and rollup
- Storage and retention
- Query and dashboard serving
- Alerting and anomaly detection
- Cardinality, cost, and reliability trade-offs
2️⃣ Core Requirements
Functional Requirements
- Applications can emit metrics
- Support counters, gauges, histograms, and timers
- Support tags / labels
- Support time-series queries
- Support dashboards
- Support alerting rules
- Support aggregation and rollups
- Support retention policies
Non-functional Requirements
- High write throughput
- Low ingestion latency
- Efficient time-series storage
- Fast range queries
- High availability
- Cost-efficient retention
- Eventually consistent metrics are acceptable
👉 Interview Answer
A metrics system collects numerical time-series data from services, stores it efficiently, supports aggregation queries, powers dashboards, and triggers alerts.
The main challenge is handling high write throughput, high-cardinality labels, fast time-range queries, and long-term storage cost.
3️⃣ Metrics vs Logs
Logs
Logs are event records.
Example:
{
"level": "ERROR",
"message": "payment failed",
"traceId": "abc"
}
Metrics
Metrics are numerical measurements over time.
Example:
request_count{service="payment", status="500"} = 1
latency_ms{service="payment"} = 123
cpu_usage{host="h1"} = 72%
Key Difference
| System | Data Type | Query Pattern |
|---|---|---|
| Logs | Detailed events | Search / debug |
| Metrics | Numeric time series | Aggregate / monitor |
👉 Interview Answer
Logs are detailed event records, while metrics are numerical time-series measurements.
Metrics are optimized for aggregation, dashboards, and alerting, while logs are optimized for search and debugging.
4️⃣ Metric Types
Counter
Monotonically increasing value.
Examples:
request_count
error_count
order_created_count
Used for:
- QPS
- Error rate
- Throughput
Gauge
Value can go up or down.
Examples:
cpu_usage
memory_usage
queue_depth
active_connections
Histogram
Tracks distribution.
Examples:
request_latency
response_size
db_query_duration
Used for:
- p50 / p95 / p99 latency
- Distribution analysis
Timer
Measures duration.
Examples:
api_latency_ms
job_runtime_seconds
👉 Interview Answer
I would support common metric types: counters for increasing values, gauges for point-in-time values, and histograms or timers for latency distributions.
Histograms are especially important for calculating p95 and p99 latency.
5️⃣ Main APIs / Interfaces
Push Metric
POST /api/metrics
Request:
{
"name": "request_count",
"value": 1,
"timestamp": "2026-05-02T10:00:00Z",
"type": "counter",
"labels": {
"service": "payment-service",
"endpoint": "/checkout",
"status": "500",
"region": "us-east-1"
}
}
Query Metrics
GET /api/metrics/query?metric=request_count&from=...&to=...
Dashboard Query
POST /api/metrics/query
Request:
{
"metric": "request_latency",
"aggregation": "p95",
"groupBy": ["service", "endpoint"],
"from": "2026-05-02T09:00:00Z",
"to": "2026-05-02T10:00:00Z"
}
👉 Interview Answer
Metrics can be pushed directly to the system, but in production, metrics are usually collected by agents or sidecars.
The query API should support time ranges, aggregation functions, filters, and group-by dimensions.
6️⃣ Collection Model
Option 1: Push Model
Application → Metrics System
Pros
- Simple for applications
- Works well for short-lived jobs
Cons
- More client complexity
- Harder to control ingestion load
Option 2: Pull Model
Metrics Collector → Scrape Application Endpoint
Example:
GET /metrics
Pros
- Centralized control
- Easier service discovery
- Common in Prometheus-style systems
Cons
- Collector must discover targets
- Harder for short-lived jobs
Option 3: Agent / Sidecar Model
Application → Local Agent → Metrics Pipeline
Pros
- Local buffering
- Batching
- Reduces app complexity
- Good for high scale
👉 Interview Answer
Metrics can be collected through push, pull, or agent-based models.
Pull-based collection gives centralized control, while push-based collection works better for short-lived jobs.
At large scale, using local agents or sidecars can provide batching, buffering, and better isolation from application latency.
7️⃣ Time-Series Data Model
Time Series Definition
A time series is identified by:
metric_name + label set
Example:
request_count{
service="payment",
endpoint="/checkout",
status="500"
}
Each data point:
(timestamp, value)
Sample Row
metric_sample (
metric_name VARCHAR,
labels_hash VARCHAR,
timestamp TIMESTAMP,
value DOUBLE,
labels JSON,
PRIMARY KEY (metric_name, labels_hash, timestamp)
)
Important Concept: Cardinality
Cardinality = number of unique time series.
Example:
request_count{service, endpoint, status}
If:
100 services × 100 endpoints × 5 status codes
= 50,000 time series
👉 Interview Answer
A metric time series is identified by metric name and label set.
Cardinality is one of the most important design concerns.
If labels include high-cardinality fields such as user ID, request ID, or order ID, the number of time series can explode and make the system extremely expensive.
8️⃣ Ingestion Pipeline
Basic Flow
Application / Agent
→ Ingestion Gateway
→ Validation
→ Batching
→ Message Queue
→ Aggregation Workers
→ Time-Series Storage
→ Rollup Storage
Ingestion Gateway Responsibilities
- Authentication
- Rate limiting
- Schema validation
- Label validation
- Timestamp normalization
- Compression handling
- Tenant tagging
Aggregation Workers
Responsibilities:
- Aggregate samples into time buckets
- Compute counters / rates
- Compute histograms
- Drop invalid labels
- Route to storage
👉 Interview Answer
I would design metrics ingestion as a high-throughput pipeline.
Agents batch and send metrics to ingestion gateways.
Gateways validate labels and timestamps, then write samples to a queue.
Aggregation workers process samples, compute rollups, and write them into time-series storage.
9️⃣ Aggregation and Rollups
Why Rollups?
Raw metrics can be too expensive to keep forever.
Example:
1-second resolution for 1 year is very expensive
Rollup Examples
Raw: 1 second resolution → keep 7 days
1-minute rollup → keep 30 days
5-minute rollup → keep 6 months
1-hour rollup → keep 2 years
Common Aggregations
- sum
- avg
- min
- max
- count
- rate
- p50 / p95 / p99
Pre-aggregation
Instead of querying raw data every time:
request_count per service per minute
is precomputed.
👉 Interview Answer
Rollups are essential for cost control.
I would store high-resolution raw metrics for a short period, then downsample them into lower-resolution aggregates for long-term retention.
This keeps dashboards and historical queries efficient without storing raw data forever.
🔟 Storage Design
Hot Storage
Used for recent high-resolution data.
Requirements:
- Fast writes
- Fast time-range reads
- Efficient compression
Examples:
Time-series DB
ClickHouse
M3DB
VictoriaMetrics
Prometheus TSDB
Cold Storage
Used for long-term rollups.
Examples:
Object storage
Columnar files
Compressed archives
Partitioning
Partition by:
time bucket
tenant
metric name
label hash
Compression
Time-series data compresses well because:
- Timestamps are sequential
- Values often change slowly
- Labels are repeated
👉 Interview Answer
I would use a time-series optimized storage engine.
Recent high-resolution metrics go to hot storage, while older rolled-up data can be moved to cheaper storage.
Partitioning by time, tenant, metric name, and label hash helps both ingestion and query performance.
1️⃣1️⃣ Query System
Common Queries
QPS by service over last 1 hour
p95 latency by endpoint
error rate by region
CPU usage by host
queue depth over time
Query Flow
User dashboard query
→ Query API
→ Permission check
→ Query planner
→ Select time partitions
→ Fetch raw or rollup data
→ Aggregate / group by
→ Return time series
Query Optimization
- Use rollups for long time ranges
- Limit group-by cardinality
- Cache dashboard queries
- Precompute common aggregations
- Use downsampling for visualization
👉 Interview Answer
Metrics queries are usually time-range aggregation queries.
The query engine should choose the right resolution: raw data for recent short windows, and rollup data for longer historical ranges.
To keep queries fast, I would limit high-cardinality group-bys and cache common dashboard queries.
1️⃣2️⃣ Alerting System
Alert Rule Example
error_rate{service="payment"} > 5% for 5 minutes
Alert Flow
Metric stream / query
→ Rule evaluator
→ Condition matched
→ Dedup / grouping
→ Notification system
→ On-call user
Alerting Design
- Evaluate rules periodically
- Support thresholds
- Support burn-rate alerts
- Support missing data detection
- Deduplicate repeated alerts
- Route alerts by team / service
- Respect silence / maintenance windows
👉 Interview Answer
Alerting can be built on top of metrics.
A rule evaluator periodically checks metric conditions, such as error rate or latency thresholds.
To avoid noisy alerts, the system should support deduplication, grouping, silence windows, and routing rules.
1️⃣3️⃣ Cardinality Control
High-cardinality Labels
Dangerous labels:
user_id
request_id
session_id
order_id
email
ip_address
Why Dangerous?
They create too many unique time series.
This causes:
- High memory usage
- High storage cost
- Slow queries
- Ingestion overload
Strategies
- Reject disallowed labels
- Limit label value count
- Cardinality quotas per tenant/service
- Drop or hash dangerous labels
- Use exemplars/traces for request-level detail
- Move high-cardinality details to logs/traces
👉 Interview Answer
Cardinality control is one of the hardest parts of metrics systems.
I would prevent high-cardinality labels like user ID or request ID from being used as metric labels.
Request-level detail should usually go to logs or traces, while metrics should stay aggregated and low-cardinality.
1️⃣4️⃣ Retention and Cost Control
Retention Policy Example
Raw 10-second metrics: 7 days
1-minute rollups: 30 days
5-minute rollups: 6 months
1-hour rollups: 2 years
Cost Control Strategies
- Rollups and downsampling
- Compression
- Cardinality limits
- Per-tenant quotas
- Drop unused metrics
- Sampling for expensive histograms
- Cold storage for historical data
👉 Interview Answer
Metrics systems can become expensive because every label combination creates a new time series.
I would control cost using rollups, compression, label cardinality limits, per-tenant quotas, and different retention policies by resolution.
1️⃣5️⃣ Reliability and Failure Handling
Common Failures
- Agent cannot send metrics
- Ingestion gateway overloaded
- Queue backlog
- Storage write failure
- Query timeout
- High-cardinality explosion
- Alert evaluator failure
Strategies
- Local agent buffering
- Batching and compression
- Durable queue
- Rate limit bad producers
- Drop low-priority metrics under pressure
- Use DLQ for invalid samples
- Fallback to rollups for queries
- Monitor ingestion lag
👉 Interview Answer
Metrics systems should handle overload gracefully.
Agents can buffer locally, ingestion gateways can rate limit bad producers, and queues can absorb temporary spikes.
If the system is overloaded, it is usually better to drop low-priority metrics than to impact application traffic.
1️⃣6️⃣ Consistency Model
Stronger Consistency Needed For
- Alert rule configuration
- Tenant quotas
- Access control
- Billing-related metrics
Eventual Consistency Acceptable For
- Dashboard data
- Aggregated metrics
- Rollups
- Ingestion visibility
- Historical query results
👉 Interview Answer
Metrics systems usually accept eventual consistency.
It is acceptable if dashboard data is delayed by a few seconds.
However, alert configuration, access control, and billing-related metrics need stronger correctness.
1️⃣7️⃣ Security and Access Control
Requirements
- Tenant isolation
- RBAC for dashboards
- Restrict production metrics
- Audit access to sensitive metrics
- Encrypt data in transit and at rest
- Prevent labels from containing PII
👉 Interview Answer
Metrics may contain sensitive operational information, so access control is important.
I would enforce tenant isolation, role-based access control, encryption, and label validation to prevent PII from entering metric labels.
1️⃣8️⃣ Observability of the Metrics System
Key Metrics
- Ingestion QPS
- Samples per second
- Queue lag
- Dropped samples
- Active time-series count
- Cardinality growth
- Storage write latency
- Query latency
- Alert evaluation latency
- Cost per tenant/service
👉 Interview Answer
The metrics system itself must be observable.
I would monitor ingestion throughput, queue lag, dropped samples, active time-series count, cardinality growth, query latency, and alert evaluation delay.
1️⃣9️⃣ End-to-End Flow
Ingestion Flow
Application emits metric
→ Agent batches metrics
→ Ingestion gateway validates labels
→ Queue buffers samples
→ Aggregation workers compute rollups
→ Time-series storage writes samples
Query Flow
User opens dashboard
→ Query service checks permission
→ Select raw or rollup data
→ Execute time-range aggregation
→ Return time-series result
Alert Flow
Rule evaluator checks metric
→ Condition true for duration
→ Deduplicate alert
→ Notification system sends alert
→ On-call user notified
Key Insight
Metrics System is an aggregation-first time-series system, not a general event search system.
🧠 Staff-Level Answer (Final)
👉 Interview Answer (Full Version)
When designing a metrics system, I think of it as a high-throughput time-series data platform for monitoring, dashboards, and alerting.
Applications emit numerical measurements such as counters, gauges, histograms, and timers.
These metrics are collected by agents or scrapers, batched, validated, and sent through an ingestion pipeline.
A time series is identified by metric name plus labels. Cardinality is a major design concern, because each unique label combination creates a new time series.
I would prevent high-cardinality labels such as user ID, request ID, or order ID from being used as metric labels.
For ingestion, I would use gateways, queues, aggregation workers, and time-series storage.
Recent high-resolution data goes to hot storage, while older data is downsampled into rollups and stored more cheaply.
Query serving should choose the right resolution: raw data for recent short windows, and rollups for long historical windows.
Alerting is built on top of metrics, with rule evaluation, deduplication, grouping, silence windows, and routing.
The main trade-offs are ingestion throughput, query latency, storage cost, retention, cardinality, and consistency.
Metrics can usually be eventually consistent, but alert rules, access control, tenant quotas, and billing-related metrics need stronger correctness.
Ultimately, the goal is to provide reliable, cost-efficient, low-latency visibility into system health and performance at scale.
⭐ Final Insight
Metrics System 的核心不是存每一条事件, 而是高效存储、聚合和查询大规模时间序列数据。
中文部分
🎯 Design Metrics System
1️⃣ 核心框架
在设计 Metrics System 时,我通常从以下几个方面来分析:
- Metrics 生成和采集
- Ingestion pipeline
- Time-series 数据模型
- 聚合和 rollup
- 存储和 retention
- Query 和 dashboard serving
- Alerting 和 anomaly detection
- Cardinality、成本和可靠性权衡
2️⃣ 核心需求
功能需求
- 应用可以上报 metrics
- 支持 counters、gauges、histograms 和 timers
- 支持 tags / labels
- 支持 time-series 查询
- 支持 dashboards
- 支持 alerting rules
- 支持 aggregation 和 rollups
- 支持 retention policies
非功能需求
- 高写入吞吐
- 低 ingestion 延迟
- 高效 time-series 存储
- 快速范围查询
- 高可用
- 长期保留成本可控
- Metrics 可以接受最终一致
👉 面试回答
Metrics System 会从服务中收集数值型时间序列数据, 高效存储这些数据, 支持聚合查询, 为 dashboard 提供数据, 并触发告警。
核心挑战是处理高写入吞吐、 高基数 labels、 快速时间范围查询, 以及长期存储成本。
3️⃣ Metrics vs Logs
Logs
Logs 是事件记录。
示例:
{
"level": "ERROR",
"message": "payment failed",
"traceId": "abc"
}
Metrics
Metrics 是随时间变化的数值测量。
示例:
request_count{service="payment", status="500"} = 1
latency_ms{service="payment"} = 123
cpu_usage{host="h1"} = 72%
核心区别
| System | Data Type | Query Pattern |
|---|---|---|
| Logs | 详细事件 | 搜索 / debug |
| Metrics | 数值时间序列 | 聚合 / 监控 |
👉 面试回答
Logs 是详细事件记录, 而 metrics 是数值型时间序列测量。
Metrics 主要为 aggregation、dashboards 和 alerting 优化, Logs 主要为 search 和 debugging 优化。
4️⃣ Metric Types
Counter
单调递增值。
示例:
request_count
error_count
order_created_count
用于:
- QPS
- Error rate
- Throughput
Gauge
数值可以上升或下降。
示例:
cpu_usage
memory_usage
queue_depth
active_connections
Histogram
记录分布。
示例:
request_latency
response_size
db_query_duration
用于:
- p50 / p95 / p99 latency
- 分布分析
Timer
测量耗时。
示例:
api_latency_ms
job_runtime_seconds
👉 面试回答
我会支持常见 metric types: counters 用于单调递增值, gauges 用于某一时刻的状态值, histograms 或 timers 用于延迟分布。
Histograms 对计算 p95 和 p99 latency 特别重要。
5️⃣ 主要 API / 接口
Push Metric
POST /api/metrics
Request:
{
"name": "request_count",
"value": 1,
"timestamp": "2026-05-02T10:00:00Z",
"type": "counter",
"labels": {
"service": "payment-service",
"endpoint": "/checkout",
"status": "500",
"region": "us-east-1"
}
}
Query Metrics
GET /api/metrics/query?metric=request_count&from=...&to=...
Dashboard Query
POST /api/metrics/query
Request:
{
"metric": "request_latency",
"aggregation": "p95",
"groupBy": ["service", "endpoint"],
"from": "2026-05-02T09:00:00Z",
"to": "2026-05-02T10:00:00Z"
}
👉 面试回答
Metrics 可以直接 push 到系统, 但生产环境中 metrics 通常由 agents 或 sidecars 采集。
Query API 应该支持时间范围、 aggregation functions、 filters 和 group-by dimensions。
6️⃣ Collection Model
方案 1:Push Model
Application → Metrics System
优点
- 对应用来说简单
- 适合 short-lived jobs
缺点
- Client 逻辑更复杂
- 更难控制 ingestion load
方案 2:Pull Model
Metrics Collector → Scrape Application Endpoint
示例:
GET /metrics
优点
- 集中控制
- 服务发现更容易
- Prometheus 风格系统常见
缺点
- Collector 需要发现 targets
- 对 short-lived jobs 不够友好
方案 3:Agent / Sidecar Model
Application → Local Agent → Metrics Pipeline
优点
- 本地 buffer
- Batching
- 降低应用复杂度
- 适合大规模系统
👉 面试回答
Metrics 可以通过 push、pull 或 agent-based 模型采集。
Pull-based collection 提供更强的集中控制, push-based collection 更适合 short-lived jobs。
在大规模系统中, local agents 或 sidecars 可以提供 batching、buffering, 并减少对 application latency 的影响。
7️⃣ Time-Series Data Model
Time Series 定义
一个 time series 由以下内容唯一确定:
metric_name + label set
示例:
request_count{
service="payment",
endpoint="/checkout",
status="500"
}
每个数据点:
(timestamp, value)
Sample Row
metric_sample (
metric_name VARCHAR,
labels_hash VARCHAR,
timestamp TIMESTAMP,
value DOUBLE,
labels JSON,
PRIMARY KEY (metric_name, labels_hash, timestamp)
)
重要概念:Cardinality
Cardinality = unique time series 的数量。
示例:
request_count{service, endpoint, status}
如果:
100 services × 100 endpoints × 5 status codes
= 50,000 time series
👉 面试回答
一个 metric time series 由 metric name 和 label set 标识。
Cardinality 是 metrics system 中最重要的设计问题之一。
如果 labels 包含 user ID、request ID 或 order ID 这类高基数字段, time series 数量会爆炸, 系统成本会变得非常高。
8️⃣ Ingestion Pipeline
基本流程
Application / Agent
→ Ingestion Gateway
→ Validation
→ Batching
→ Message Queue
→ Aggregation Workers
→ Time-Series Storage
→ Rollup Storage
Ingestion Gateway 职责
- Authentication
- Rate limiting
- Schema validation
- Label validation
- Timestamp normalization
- Compression handling
- Tenant tagging
Aggregation Workers 职责
- 将 samples 聚合到 time buckets
- 计算 counters / rates
- 计算 histograms
- 丢弃非法 labels
- 路由到 storage
👉 面试回答
我会将 metrics ingestion 设计成高吞吐 pipeline。
Agents 批量发送 metrics 到 ingestion gateways。
Gateways 校验 labels 和 timestamps, 然后将 samples 写入 queue。
Aggregation workers 处理 samples, 计算 rollups, 并写入 time-series storage。
9️⃣ Aggregation and Rollups
为什么需要 Rollups?
Raw metrics 如果长期保存会非常昂贵。
例如:
1-second resolution for 1 year is very expensive
Rollup 示例
Raw: 1 second resolution → keep 7 days
1-minute rollup → keep 30 days
5-minute rollup → keep 6 months
1-hour rollup → keep 2 years
常见 Aggregations
- sum
- avg
- min
- max
- count
- rate
- p50 / p95 / p99
Pre-aggregation
不要每次都查询 raw data:
request_count per service per minute
可以提前计算。
👉 面试回答
Rollups 对成本控制非常重要。
我会短期保存高分辨率 raw metrics, 然后将它们 downsample 成更低分辨率的 aggregates 用于长期保存。
这样可以保持 dashboard 和历史查询效率, 同时避免永久保存 raw data。
🔟 存储设计
Hot Storage
用于近期高分辨率数据。
要求:
- 快速写入
- 快速时间范围查询
- 高效压缩
例如:
Time-series DB
ClickHouse
M3DB
VictoriaMetrics
Prometheus TSDB
Cold Storage
用于长期 rollups。
例如:
Object storage
Columnar files
Compressed archives
Partitioning
按以下维度分区:
time bucket
tenant
metric name
label hash
Compression
Time-series data 很适合压缩,因为:
- Timestamps 连续
- Values 通常变化缓慢
- Labels 重复出现
👉 面试回答
我会使用 time-series optimized storage engine。
最近的高分辨率 metrics 存在 hot storage, 老数据经过 rollup 后可以迁移到更便宜的存储。
按 time、tenant、metric name 和 label hash 分区, 有助于提升 ingestion 和 query performance。
1️⃣1️⃣ Query System
常见 Queries
QPS by service over last 1 hour
p95 latency by endpoint
error rate by region
CPU usage by host
queue depth over time
Query Flow
User dashboard query
→ Query API
→ Permission check
→ Query planner
→ Select time partitions
→ Fetch raw or rollup data
→ Aggregate / group by
→ Return time series
Query Optimization
- 长时间范围使用 rollups
- 限制 group-by cardinality
- 缓存 dashboard queries
- 预计算常见 aggregations
- 为可视化做 downsampling
👉 面试回答
Metrics queries 通常是时间范围聚合查询。
Query engine 应该选择合适的数据分辨率: 短时间窗口使用 raw data, 长时间历史查询使用 rollup data。
为了保持查询快速, 我会限制高基数 group-by, 并缓存常见 dashboard queries。
1️⃣2️⃣ Alerting System
Alert Rule Example
error_rate{service="payment"} > 5% for 5 minutes
Alert Flow
Metric stream / query
→ Rule evaluator
→ Condition matched
→ Dedup / grouping
→ Notification system
→ On-call user
Alerting Design
- 定期执行 rule evaluation
- 支持 threshold
- 支持 burn-rate alerts
- 支持 missing data detection
- Deduplicate repeated alerts
- 按 team / service route alerts
- 支持 silence / maintenance windows
👉 面试回答
Alerting 可以构建在 metrics 之上。
Rule evaluator 会周期性检查 metric conditions, 例如 error rate 或 latency 是否超过阈值。
为了避免 noisy alerts, 系统应该支持 deduplication、grouping、 silence windows 和 routing rules。
1️⃣3️⃣ Cardinality Control
High-cardinality Labels
危险 labels:
user_id
request_id
session_id
order_id
email
ip_address
为什么危险?
它们会创建太多 unique time series。
这会导致:
- 内存使用高
- 存储成本高
- 查询变慢
- Ingestion overload
策略
- 拒绝不允许的 labels
- 限制 label value count
- 对每个 tenant / service 设置 cardinality quota
- Drop 或 hash 危险 labels
- 使用 exemplars / traces 处理 request-level detail
- 将高基数细节放到 logs / traces 中
👉 面试回答
Cardinality control 是 metrics system 中最难的问题之一。
我会防止 user ID、request ID 这类高基数字段 被用作 metric labels。
Request-level detail 通常应该进入 logs 或 traces, 而 metrics 应该保持聚合和低基数。
1️⃣4️⃣ Retention and Cost Control
Retention Policy Example
Raw 10-second metrics: 7 days
1-minute rollups: 30 days
5-minute rollups: 6 months
1-hour rollups: 2 years
Cost Control Strategies
- Rollups and downsampling
- Compression
- Cardinality limits
- Per-tenant quotas
- Drop unused metrics
- Sampling for expensive histograms
- Cold storage for historical data
👉 面试回答
Metrics system 可能很昂贵, 因为每个 label combination 都会创建一个新的 time series。
我会通过 rollups、compression、 label cardinality limits、per-tenant quotas 和不同分辨率的 retention policies 来控制成本。
1️⃣5️⃣ Reliability and Failure Handling
Common Failures
- Agent cannot send metrics
- Ingestion gateway overloaded
- Queue backlog
- Storage write failure
- Query timeout
- High-cardinality explosion
- Alert evaluator failure
Strategies
- Local agent buffering
- Batching and compression
- Durable queue
- Rate limit bad producers
- Drop low-priority metrics under pressure
- Use DLQ for invalid samples
- Fallback to rollups for queries
- Monitor ingestion lag
👉 面试回答
Metrics system 应该能优雅处理过载。
Agents 可以本地 buffer, ingestion gateways 可以对异常 producer 限流, queues 可以吸收临时流量峰值。
如果系统过载, 通常宁可丢弃低优先级 metrics, 也不应该影响应用自身流量。
1️⃣6️⃣ Consistency Model
需要较强一致性的场景
- Alert rule configuration
- Tenant quotas
- Access control
- Billing-related metrics
可以最终一致的场景
- Dashboard data
- Aggregated metrics
- Rollups
- Ingestion visibility
- Historical query results
👉 面试回答
Metrics system 通常可以接受最终一致。
Dashboard 数据延迟几秒通常是可以接受的。
但是 alert configuration、access control、 tenant quotas 和 billing-related metrics 需要更强正确性。
1️⃣7️⃣ Security and Access Control
Requirements
- Tenant isolation
- RBAC for dashboards
- Restrict production metrics
- Audit access to sensitive metrics
- Encrypt data in transit and at rest
- Prevent labels from containing PII
👉 面试回答
Metrics 可能包含敏感的运维信息, 所以 access control 很重要。
我会强制 tenant isolation、RBAC、encryption, 并进行 label validation, 防止 PII 被写入 metric labels。
1️⃣8️⃣ Metrics System 自身的可观测性
Key Metrics
- Ingestion QPS
- Samples per second
- Queue lag
- Dropped samples
- Active time-series count
- Cardinality growth
- Storage write latency
- Query latency
- Alert evaluation latency
- Cost per tenant / service
👉 面试回答
Metrics system 自身也必须可观测。
我会监控 ingestion throughput、queue lag、 dropped samples、active time-series count、 cardinality growth、query latency 和 alert evaluation delay。
1️⃣9️⃣ End-to-End Flow
Ingestion Flow
Application emits metric
→ Agent batches metrics
→ Ingestion gateway validates labels
→ Queue buffers samples
→ Aggregation workers compute rollups
→ Time-series storage writes samples
Query Flow
User opens dashboard
→ Query service checks permission
→ Select raw or rollup data
→ Execute time-range aggregation
→ Return time-series result
Alert Flow
Rule evaluator checks metric
→ Condition true for duration
→ Deduplicate alert
→ Notification system sends alert
→ On-call user notified
Key Insight
Metrics System 是 aggregation-first 的 time-series system, 不是通用事件搜索系统。
🧠 Staff-Level Answer(最终版)
👉 面试回答(完整背诵版)
在设计 Metrics System 时, 我会把它看作一个高吞吐的 time-series data platform, 用于 monitoring、dashboards 和 alerting。
应用会上报 counters、gauges、histograms 和 timers 这类数值型 metrics。
这些 metrics 通过 agents 或 scrapers 采集, 经过 batching、validation, 再进入 ingestion pipeline。
一个 time series 由 metric name 加 labels 唯一确定。 Cardinality 是核心设计问题, 因为每一种不同的 label combination 都会创建新的 time series。
我会禁止 user ID、request ID 或 order ID 这类高基数字段作为 metric labels。
对于 ingestion, 我会使用 gateways、queues、aggregation workers 和 time-series storage。
最近的高分辨率数据进入 hot storage, 老数据会 downsample 成 rollups, 并使用更低成本的方式保存。
Query serving 应该选择合适的数据分辨率: 短时间窗口使用 raw data, 长历史窗口使用 rollups。
Alerting 构建在 metrics 之上, 包括 rule evaluation、deduplication、grouping、 silence windows 和 routing。
核心权衡包括 ingestion throughput、query latency、 storage cost、retention、cardinality 和 consistency。
Metrics 通常可以最终一致, 但 alert rules、access control、tenant quotas 和 billing-related metrics 需要更强正确性。
最终目标是在大规模下, 以可靠、低成本、低延迟的方式 提供系统健康状态和性能可见性。
⭐ Final Insight
Metrics System 的核心不是存每一条事件, 而是高效存储、聚合和查询大规模时间序列数据。
Implement