d&d-t System Design Deep Dive ·

🎯 Design Logging System

1️⃣ Core Framework

When discussing Logging System design, I frame it as:

Log generation and collection
Log ingestion pipeline
Buffering, queueing, and backpressure
Log storage and indexing
Search and query serving
Retention, sampling, and cost control
Reliability, ordering, and failure handling
Security, access control, and observability

2️⃣ Core Requirements

Functional Requirements

Applications can emit logs
Support structured and unstructured logs
Support log ingestion from many services
Support log search by time range, service, level, trace ID
Support real-time log streaming
Support retention policies
Support alerts based on log patterns
Support access control

Non-functional Requirements

High write throughput
Low ingestion latency
Durable log storage
Scalable query performance
High availability
Cost-efficient long-term retention
Eventually consistent search is acceptable

👉 Interview Answer

A logging system collects logs from many applications, ingests them through a scalable pipeline, stores them durably, indexes them for search, and supports querying, alerting, and retention.

The main challenge is handling very high write throughput while keeping search fast and storage cost under control.

3️⃣ Main APIs / Interfaces

Emit Log

Usually logs are not sent directly by application code to the central service.

They are usually written locally first:

Application → stdout / log file → agent

Log Ingestion API

POST /api/logs

Request:

{
  "service": "payment-service",
  "environment": "prod",
  "level": "ERROR",
  "message": "payment failed",
  "timestamp": "2026-05-02T10:00:00Z",
  "traceId": "trace-123",
  "metadata": {
    "orderId": "o789",
    "region": "us-east-1"
  }
}

Search Logs

GET /api/logs/search?service=payment-service&level=ERROR&from=...&to=...

Stream Logs

GET /api/logs/stream?service=payment-service

👉 Interview Answer

Applications usually write logs to stdout or local files.

A log agent collects those logs and sends them to the central ingestion pipeline.

The system should also expose search and streaming APIs for debugging, monitoring, and incident response.

4️⃣ Data Model

Log Event

{
  "logId": "log-123",
  "service": "payment-service",
  "environment": "prod",
  "level": "ERROR",
  "timestamp": "2026-05-02T10:00:00Z",
  "message": "payment failed",
  "traceId": "trace-123",
  "host": "host-1",
  "region": "us-east-1",
  "metadata": {
    "orderId": "o789"
  }
}

Common Indexed Fields

timestamp
service
environment
log level
trace ID
host
region
request ID
user-defined tags

Raw Log vs Indexed Log

Raw log = complete original event
Indexed log = selected fields optimized for search

👉 Interview Answer

I would store logs as structured events whenever possible.

Common fields like timestamp, service, level, environment, trace ID, and host should be indexed.

The raw log should still be preserved, while selected fields are indexed for fast querying.

5️⃣ Log Collection

Option 1: Application Pushes Logs Directly

Application → Logging Service

Pros

Simple conceptually
Direct control from application

Cons

Logging failure can affect application
Adds latency to request path
Harder to buffer locally

Option 2: Local Agent Collection

Application → stdout / file → Log Agent → Pipeline

Pros

Decouples app from logging system
Supports local buffering
Standard production pattern
Works well with containers and Kubernetes

Cons

Agent needs deployment and monitoring
Agent configuration complexity

I would avoid having applications synchronously send logs to the central logging service.

Instead, applications should write logs to stdout or local files, and a local agent collects, buffers, enriches, and forwards them asynchronously.

This prevents logging failures from impacting application latency.

6️⃣ Ingestion Pipeline

Basic Flow

Application
→ Log Agent
→ Ingestion Gateway
→ Message Queue / Log Bus
→ Processing Workers
→ Storage
→ Index

Ingestion Gateway Responsibilities

Authentication
Rate limiting
Schema validation
Compression handling
Tenant/service tagging
Load shedding when overloaded

Processing Workers

Responsibilities:

Parse logs
Normalize fields
Enrich metadata
Redact sensitive data
Route to storage/index
Generate alerts if needed

👉 Interview Answer

I would design the ingestion path as an asynchronous pipeline.

Log agents send logs to ingestion gateways, which validate, authenticate, and forward logs to a durable queue.

Processing workers then parse, normalize, enrich, redact, and route logs to storage and indexing systems.

7️⃣ Buffering and Backpressure

Why Needed?

Log volume can spike during incidents.

Ironically, when systems fail, they often generate more logs.

Buffer Layers

Agent local disk buffer
Ingestion gateway memory buffer
Durable queue / log bus
Storage write buffer

Backpressure Strategies

Apply rate limiting per service / tenant
Drop debug logs first
Sample high-volume logs
Prioritize ERROR logs
Slow down non-critical producers
Use local disk buffering in agents

👉 Interview Answer

Backpressure is essential because log volume often spikes during incidents.

I would use multiple buffering layers: local agent buffers, a durable queue, and storage write buffers.

If the system is overloaded, I would prioritize critical logs like ERROR, sample or drop DEBUG logs, and apply per-service rate limits.

8️⃣ Storage Design

Hot Storage

Used for recent logs.

Requirements:

Fast search
High write throughput
Time-range queries

Examples:

Elasticsearch / OpenSearch / ClickHouse

Cold Storage

Used for long-term retention.

Requirements:

Low cost
Durable
Slower query acceptable

Examples:

S3 / object storage

Recommended Storage Strategy

Recent logs → hot searchable storage
Older logs → cold object storage

Partitioning

Partition by:

date / hour
tenant
service
environment

👉 Interview Answer

I would separate hot and cold storage.

Recent logs go to hot storage for fast search, while older logs are moved to cheaper object storage for long-term retention.

Logs are naturally time-series data, so partitioning by time, tenant, and service is usually effective.

9️⃣ Indexing Strategy

What to Index?

Index common fields:

timestamp
service
level
environment
trace ID
request ID
host
region

Avoid indexing every field.

Why Not Index Everything?

Expensive storage
Slower writes
Higher memory usage
Explosive cardinality

High-cardinality Fields

Examples:

user ID
request ID
session ID
order ID

These can be useful but expensive.

👉 Interview Answer

I would not index every field by default.

Indexing improves query performance, but increases write cost, storage cost, and memory usage.

I would index common fields like timestamp, service, level, and trace ID, while controlling high-cardinality fields carefully.

🔟 Query and Search Flow

Search Flow

User submits query
→ Query service validates permission
→ Query planner selects shards/partitions
→ Search hot storage
→ Optionally query cold storage
→ Merge results
→ Return logs

Common Query Patterns

service = payment-service AND level = ERROR
traceId = trace-123
timestamp between T1 and T2
message contains "timeout"

Query Optimization

Time-range filtering first
Service/environment filters
Index-based filtering
Limit result size
Cursor-based pagination
Async query for large searches

👉 Interview Answer

Log queries are usually time-bound.

I would first narrow the query by time range, then by service, environment, and indexed fields.

For very large queries, I would run them asynchronously and limit result size to protect the system.

1️⃣1️⃣ Real-time Log Streaming

Use Cases

Tail logs during deployment
Debug production incident
Watch one service in real time

Flow

Log ingestion
→ Stream processor
→ Pub/Sub topic
→ WebSocket / SSE
→ User UI

Challenges

High fanout
User permissions
Backpressure
Streaming too much data

👉 Interview Answer

Real-time log streaming can be implemented using pub/sub and WebSocket or server-sent events.

The system should enforce permissions, limit stream volume, and apply backpressure to avoid overwhelming the UI or backend.

1️⃣2️⃣ Retention, Sampling, and Cost Control

Retention Policy

Example:

ERROR logs: 90 days hot, 1 year cold
INFO logs: 14 days hot, 90 days cold
DEBUG logs: 1 day hot, 7 days cold

Sampling

Useful for high-volume logs.

Examples:

Keep 100% ERROR logs
Keep 10% INFO logs
Keep 1% DEBUG logs

Cost Control Strategies

Different retention by log level
Sampling low-value logs
Compress logs
Move old logs to cold storage
Limit high-cardinality indexing
Drop noisy logs at ingestion

👉 Interview Answer

Logging systems can become very expensive, so retention and sampling are important.

I would keep ERROR logs longer, sample INFO and DEBUG logs more aggressively, compress stored logs, and move older data to cold object storage.

1️⃣3️⃣ Alerting on Logs

Use Cases

Error rate spike
Specific exception pattern
Security event detection
Failed payment increase

Flow

Log stream
→ Rule engine
→ Alert condition matched
→ Notification system
→ Pager / Slack / Email

Design Notes

Alert rules should be evaluated on streaming logs
Avoid alerting on raw single log line when possible
Use aggregation windows
Deduplicate alerts
Add severity and routing rules

👉 Interview Answer

Log-based alerting should be built on top of the log stream.

A rule engine can evaluate patterns or aggregations, such as error count per service over five minutes.

To avoid noisy alerts, I would use aggregation windows, deduplication, severity levels, and routing rules.

1️⃣4️⃣ Security and Privacy

Risks

Logs may contain sensitive data
Developers may log tokens or passwords
Cross-tenant data exposure
Unauthorized access to production logs

Protections

Redact sensitive fields
Encrypt logs at rest and in transit
Role-based access control
Tenant isolation
Audit log access
Data deletion support
PII detection pipeline

👉 Interview Answer

Logs often contain sensitive data, so security and privacy are critical.

I would redact secrets and PII during processing, encrypt logs at rest and in transit, enforce role-based access control, and audit access to production logs.

1️⃣5️⃣ Failure Handling

Common Failures

Agent cannot reach ingestion gateway
Queue backlog
Storage write failure
Indexing delay
Hot shard overload
Query timeout
Malformed log event
Log storm during incidents

Strategies

Local disk buffer in agent
Durable queue
Retry with backoff
Dead-letter queue for bad logs
Degrade search but keep ingestion
Drop low-priority logs under pressure
Re-index from raw storage

👉 Interview Answer

The logging system should prioritize ingestion durability.

If indexing is delayed, logs should still be stored in raw durable storage.

If the system is overloaded, it can drop or sample low-priority logs, but should preserve critical ERROR and security logs.

Since raw logs are stored durably, indexes can be rebuilt later.

1️⃣6️⃣ Consistency Model

Stronger Consistency Needed For

Security audit logs
Compliance logs
Raw log durability
Access control decisions

Eventual Consistency Acceptable For

Search index updates
Alerting delay
Dashboard metrics
Cold storage availability
Aggregated statistics

👉 Interview Answer

Logging systems usually accept eventual consistency for search.

It is acceptable if a log line becomes searchable after a short delay.

However, raw log durability and security audit logs need stronger guarantees, because they may be used for compliance or incident investigation.

1️⃣7️⃣ Observability of the Logging System

Key Metrics

Ingestion QPS
Ingestion latency
Queue lag
Dropped log count
Indexed log count
Indexing lag
Storage write error rate
Query latency
Query error rate
Hot shard count
Cost per service / tenant

Important Dashboards

Ingestion health
Queue backlog
Indexing freshness
Search performance
Storage growth
Dropped / sampled logs
Tenant usage and cost

👉 Interview Answer

The logging system itself must be observable.

I would monitor ingestion QPS, queue lag, dropped logs, indexing delay, storage errors, and query latency.

This helps detect whether we are losing logs, falling behind, or spending too much on storage.

1️⃣8️⃣ End-to-End Flow

Log Ingestion Flow

Application writes log
→ Local agent reads stdout/file
→ Agent buffers and batches logs
→ Ingestion gateway validates request
→ Queue stores logs durably
→ Workers parse and enrich logs
→ Raw logs stored
→ Indexed fields sent to search index

Search Flow

User submits query
→ Permission check
→ Query planner selects time partitions
→ Search hot index
→ Fetch raw log content
→ Merge and paginate results
→ Return response

Alert Flow

Log stream
→ Rule engine
→ Aggregation window
→ Alert condition matched
→ Notification system
→ On-call user notified

Key Insight

Logging System is a high-throughput data pipeline, not just a place to store text.

🧠 Staff-Level Answer (Final)

👉 Interview Answer (Full Version)

When designing a logging system, I think of it as a high-throughput data pipeline for collecting, storing, indexing, searching, and alerting on logs.

Applications should not synchronously send logs to the central logging service. Instead, they write logs to stdout or local files, and local agents collect, buffer, enrich, and forward logs asynchronously.

The ingestion path should use gateways, durable queues, processing workers, raw storage, and search indexes.

Recent logs should go to hot storage for fast search, while older logs should be moved to cheaper cold storage.

I would index common fields like timestamp, service, environment, level, trace ID, and host, but avoid indexing every field because high-cardinality indexing can be very expensive.

For query serving, I would first narrow by time range, then use indexed fields like service and level, and support pagination or async queries for large searches.

Backpressure is critical because log volume often spikes during incidents. I would use local buffering, durable queues, rate limiting, sampling, and priority handling to preserve critical logs while controlling cost.

Security is also important: logs should be encrypted, sensitive data should be redacted, and access should be controlled and audited.

The main trade-offs are ingestion durability, search freshness, query latency, storage cost, and indexing complexity.

Ultimately, the goal is to reliably collect massive log volume, make recent logs searchable quickly, retain older logs cost-effectively, and support debugging, monitoring, and incident response.

⭐ Final Insight

Logging System 的核心不是存文本，而是构建一个高吞吐、可搜索、可控成本、可审计的数据管道。

中文部分

🎯 Design Logging System

1️⃣ 核心框架

在设计 Logging System 时，我通常从以下几个方面来分析：

日志生成和采集
日志 ingestion pipeline
Buffer、queue 和 backpressure
日志存储和索引
Search 和 query serving
Retention、sampling 和成本控制
可靠性、顺序和故障处理
安全、访问控制和可观测性

2️⃣ 核心需求

功能需求

应用可以产生日志
支持结构化和非结构化日志
支持从大量服务采集日志
支持按时间范围、服务、日志等级、trace ID 搜索
支持实时 log streaming
支持 retention policies
支持基于日志模式触发告警
支持访问控制

非功能需求

高写入吞吐
低 ingestion 延迟
日志持久化存储
可扩展的查询性能
高可用
长期保存成本可控
Search 可以接受最终一致

👉 面试回答

Logging System 会从大量应用中收集日志，通过可扩展的 pipeline 进行 ingestion，将日志持久化存储，并建立索引用于搜索。

同时，它还需要支持查询、告警和 retention。

核心挑战是在处理极高写入吞吐的同时，保持搜索性能并控制存储成本。

3️⃣ 主要 API / 接口

Emit Log

通常 application code 不会直接同步发送日志到中心系统。

一般是先写到本地：

Application → stdout / log file → agent

Log Ingestion API

POST /api/logs

Request:

{
  "service": "payment-service",
  "environment": "prod",
  "level": "ERROR",
  "message": "payment failed",
  "timestamp": "2026-05-02T10:00:00Z",
  "traceId": "trace-123",
  "metadata": {
    "orderId": "o789",
    "region": "us-east-1"
  }
}

Search Logs

GET /api/logs/search?service=payment-service&level=ERROR&from=...&to=...

Stream Logs

GET /api/logs/stream?service=payment-service

👉 面试回答

应用通常会将日志写到 stdout 或本地文件。

Log agent 负责采集这些日志，并发送到中心化 ingestion pipeline。

系统也需要提供 search 和 streaming API，用于 debugging、monitoring 和 incident response。

4️⃣ 数据模型

Log Event

{
  "logId": "log-123",
  "service": "payment-service",
  "environment": "prod",
  "level": "ERROR",
  "timestamp": "2026-05-02T10:00:00Z",
  "message": "payment failed",
  "traceId": "trace-123",
  "host": "host-1",
  "region": "us-east-1",
  "metadata": {
    "orderId": "o789"
  }
}

常见索引字段

timestamp
service
environment
log level
trace ID
host
region
request ID
user-defined tags

Raw Log vs Indexed Log

Raw log = 完整原始事件
Indexed log = 为搜索优化的部分字段

👉 面试回答

我会尽量将日志设计成结构化事件。

常见字段，比如 timestamp、service、level、 environment、trace ID 和 host 应该被索引。

原始日志仍然应该保留，同时只选择必要字段建立索引以支持快速查询。

5️⃣ 日志采集

方案 1：Application 直接 Push Logs

Application → Logging Service

优点

概念简单
应用直接控制

缺点

日志系统失败可能影响应用
增加请求路径延迟
难以本地 buffer

方案 2：Local Agent Collection

Application → stdout / file → Log Agent → Pipeline

优点

应用和 logging system 解耦
支持本地 buffer
是生产环境常见模式
非常适合 containers 和 Kubernetes

缺点

需要部署和监控 agent
Agent 配置有复杂度

6️⃣ Ingestion Pipeline

基本流程

Application
→ Log Agent
→ Ingestion Gateway
→ Message Queue / Log Bus
→ Processing Workers
→ Storage
→ Index

Ingestion Gateway 职责

Authentication
Rate limiting
Schema validation
Compression handling
Tenant / service tagging
系统过载时 load shedding

Processing Workers 职责

Parse logs
Normalize fields
Enrich metadata
Redact sensitive data
Route to storage / index
必要时生成 alerts

👉 面试回答

我会将 ingestion path 设计成异步 pipeline。

Log agents 将日志发送到 ingestion gateways， gateways 负责校验、认证，然后将日志写入 durable queue。

Processing workers 再进行解析、标准化、元数据补充、敏感数据脱敏，并将日志路由到存储和索引系统。

7️⃣ Buffering and Backpressure

为什么需要？

日志量在 incident 期间经常会暴增。

很讽刺的是，系统出故障时通常会产生更多日志。

Buffer Layers

Agent local disk buffer
Ingestion gateway memory buffer
Durable queue / log bus
Storage write buffer

Backpressure Strategies

按 service / tenant 限流
优先丢弃 DEBUG logs
对高流量日志做 sampling
优先保留 ERROR logs
降低非关键 producer 速率
Agent 使用本地磁盘 buffer

👉 面试回答

Backpressure 非常重要，因为日志量经常会在 incident 期间暴增。

我会使用多层 buffer： local agent buffer、durable queue 和 storage write buffer。

如果系统过载，我会优先保留 ERROR 等关键日志，对 DEBUG logs 做采样或丢弃，并按 service 进行限流。

8️⃣ 存储设计

Hot Storage

用于近期日志。

要求：

快速搜索
高写入吞吐
支持时间范围查询

例如：

Elasticsearch / OpenSearch / ClickHouse

Cold Storage

用于长期保留。

要求：

成本低
持久
查询慢一些可以接受

例如：

S3 / object storage

Partitioning

按以下维度分区：

date / hour
tenant
service
environment

👉 面试回答

我会将 hot storage 和 cold storage 分开。

最近日志进入 hot storage，用于快速搜索；老日志迁移到更便宜的 object storage，用于长期保存。

日志天然是 time-series data，所以按时间、tenant 和 service 分区通常很有效。

9️⃣ Indexing Strategy

索引什么？

常见字段：

timestamp
service
level
environment
trace ID
request ID
host
region

不要默认索引所有字段。

为什么不索引所有字段？

存储成本高
写入变慢
内存使用更高
高基数字段爆炸

High-cardinality Fields

例如：

user ID
request ID
session ID
order ID

这些字段有用，但索引成本高。

👉 面试回答

我不会默认索引所有字段。

索引可以提升查询性能，但会增加写入成本、存储成本和内存使用。

我会索引 timestamp、service、level 和 trace ID 这类常用字段，同时谨慎控制高基数字段。

🔟 Query and Search Flow

Search Flow

User submits query
→ Query service validates permission
→ Query planner selects shards/partitions
→ Search hot storage
→ Optionally query cold storage
→ Merge results
→ Return logs

常见 Query Patterns

service = payment-service AND level = ERROR
traceId = trace-123
timestamp between T1 and T2
message contains "timeout"

Query Optimization

先按时间范围过滤
再按 service / environment 过滤
使用索引字段过滤
限制结果大小
Cursor-based pagination
大查询使用 async query

👉 面试回答

Log query 通常都有时间范围。

我会先通过 time range 缩小查询范围，然后再用 service、environment 和 indexed fields 过滤。

对于非常大的查询，我会使用异步查询，并限制结果大小来保护系统。

1️⃣1️⃣ Real-time Log Streaming

Use Cases

部署时 tail logs
Debug production incident
实时观察某个服务日志

Flow

Log ingestion
→ Stream processor
→ Pub/Sub topic
→ WebSocket / SSE
→ User UI

Challenges

High fanout
User permissions
Backpressure
Streaming too much data

👉 面试回答

Real-time log streaming 可以基于 pub/sub 以及 WebSocket 或 server-sent events 实现。

系统需要强制权限检查，限制 stream volume，并使用 backpressure，避免压垮 UI 或后端。

1️⃣2️⃣ Retention, Sampling, and Cost Control

Retention Policy

示例：

ERROR logs: 90 days hot, 1 year cold
INFO logs: 14 days hot, 90 days cold
DEBUG logs: 1 day hot, 7 days cold

Sampling

适合高流量日志。

示例：

Keep 100% ERROR logs
Keep 10% INFO logs
Keep 1% DEBUG logs

Cost Control Strategies

根据 log level 设置不同 retention
对低价值日志 sampling
压缩日志
将老日志移到 cold storage
限制高基数字段索引
在 ingestion 阶段丢弃 noisy logs

👉 面试回答

Logging system 很容易变得非常昂贵，所以 retention 和 sampling 很重要。

我会让 ERROR logs 保留更久，对 INFO 和 DEBUG logs 更激进地采样，压缩存储日志，并将旧数据迁移到 cold object storage。

1️⃣3️⃣ Alerting on Logs

Use Cases

Error rate spike
Specific exception pattern
Security event detection
Failed payment increase

Flow

Log stream
→ Rule engine
→ Alert condition matched
→ Notification system
→ Pager / Slack / Email

Design Notes

Alert rules should be evaluated on streaming logs
尽量不要基于单条日志直接告警
使用 aggregation windows
Deduplicate alerts
加入 severity 和 routing rules

👉 面试回答

Log-based alerting 可以构建在 log stream 之上。

Rule engine 可以评估日志模式或聚合条件，例如五分钟内某个服务的 error count。

为了避免 noisy alerts，我会使用聚合窗口、去重、严重等级和 routing rules。

1️⃣4️⃣ Security and Privacy

Risks

Logs may contain sensitive data
Developers may log tokens or passwords
Cross-tenant data exposure
Unauthorized access to production logs

Protections

Redact sensitive fields
Encrypt logs at rest and in transit
Role-based access control
Tenant isolation
Audit log access
Data deletion support
PII detection pipeline

👉 面试回答

Logs 经常包含敏感数据，所以安全和隐私非常关键。

我会在 processing 阶段脱敏 secrets 和 PII，对日志进行传输和静态加密，强制 RBAC，并审计 production logs 的访问行为。

1️⃣5️⃣ Failure Handling

Common Failures

Agent cannot reach ingestion gateway
Queue backlog
Storage write failure
Indexing delay
Hot shard overload
Query timeout
Malformed log event
Log storm during incidents

Strategies

Agent local disk buffer
Durable queue
Retry with backoff
Dead-letter queue for bad logs
Degrade search but keep ingestion
Drop low-priority logs under pressure
Re-index from raw storage

👉 面试回答

Logging system 应该优先保证 ingestion durability。

如果 indexing 延迟，日志仍然应该先写入 raw durable storage。

如果系统过载，可以丢弃或采样低优先级日志，但应该尽量保留 ERROR 和 security logs。

由于 raw logs 已经持久化， index 可以之后重建。

1️⃣6️⃣ Consistency Model

需要较强一致性的场景

Security audit logs
Compliance logs
Raw log durability
Access control decisions

可以最终一致的场景

Search index updates
Alerting delay
Dashboard metrics
Cold storage availability
Aggregated statistics

👉 面试回答

Logging system 通常可以接受 search 最终一致。

一条日志晚几秒变得可搜索，通常是可以接受的。

但是 raw log durability 和 security audit logs 需要更强保证，因为它们可能用于合规或 incident investigation。

1️⃣7️⃣ Logging System 自身的可观测性

Key Metrics

Ingestion QPS
Ingestion latency
Queue lag
Dropped log count
Indexed log count
Indexing lag
Storage write error rate
Query latency
Query error rate
Hot shard count
Cost per service / tenant

Important Dashboards

Ingestion health
Queue backlog
Indexing freshness
Search performance
Storage growth
Dropped / sampled logs
Tenant usage and cost

👉 面试回答

Logging system 自身也必须可观测。

我会监控 ingestion QPS、queue lag、 dropped logs、indexing delay、storage errors 和 query latency。

这样可以帮助我们判断是否正在丢日志、是否处理落后，以及存储成本是否过高。

1️⃣8️⃣ End-to-End Flow

Log Ingestion Flow

Application writes log
→ Local agent reads stdout/file
→ Agent buffers and batches logs
→ Ingestion gateway validates request
→ Queue stores logs durably
→ Workers parse and enrich logs
→ Raw logs stored
→ Indexed fields sent to search index

Search Flow

User submits query
→ Permission check
→ Query planner selects time partitions
→ Search hot index
→ Fetch raw log content
→ Merge and paginate results
→ Return response

Alert Flow

Log stream
→ Rule engine
→ Aggregation window
→ Alert condition matched
→ Notification system
→ On-call user notified

Key Insight

Logging System 是高吞吐数据管道，不是简单存储文本的地方。

🧠 Staff-Level Answer（最终版）

👉 面试回答（完整背诵版）

在设计 Logging System 时，我会把它看作一个高吞吐数据管道，用于日志采集、存储、索引、搜索和告警。

应用不应该同步发送日志到中心化 logging service。更好的方式是应用写 stdout 或本地文件，然后由本地 agent 异步采集、buffer、enrich 并转发日志。

Ingestion path 应该包含 gateways、durable queues、 processing workers、raw storage 和 search indexes。

最近日志应该进入 hot storage，用于快速搜索；老日志应该迁移到更便宜的 cold storage，用于长期保留。

我会索引 timestamp、service、environment、level、 trace ID 和 host 等常见字段，但不会默认索引所有字段，因为高基数字段索引会非常昂贵。

对于 query serving，我会先通过 time range 缩小范围，再使用 service、level 等索引字段过滤，对大查询使用 pagination 或 async query。

Backpressure 非常关键，因为 incident 期间日志量通常会暴增。我会使用本地 buffer、durable queue、 rate limiting、sampling 和 priority handling，在控制成本的同时保留关键日志。

安全也很重要：日志应该加密，敏感数据应该脱敏，访问权限应该被控制和审计。

核心权衡包括 ingestion durability、 search freshness、query latency、storage cost 和 indexing complexity。

最终目标是可靠采集海量日志，让近期日志快速可搜索，低成本保留历史日志，并支持 debugging、monitoring 和 incident response。

⭐ Final Insight

Logging System 的核心不是存文本，而是构建一个高吞吐、可搜索、可控成本、可审计的数据管道。