🎯 Design Logging System
1️⃣ Core Framework
When discussing Logging System design, I frame it as:
- Log generation and collection
- Log ingestion pipeline
- Buffering, queueing, and backpressure
- Log storage and indexing
- Search and query serving
- Retention, sampling, and cost control
- Reliability, ordering, and failure handling
- Security, access control, and observability
2️⃣ Core Requirements
Functional Requirements
- Applications can emit logs
- Support structured and unstructured logs
- Support log ingestion from many services
- Support log search by time range, service, level, trace ID
- Support real-time log streaming
- Support retention policies
- Support alerts based on log patterns
- Support access control
Non-functional Requirements
- High write throughput
- Low ingestion latency
- Durable log storage
- Scalable query performance
- High availability
- Cost-efficient long-term retention
- Eventually consistent search is acceptable
👉 Interview Answer
A logging system collects logs from many applications, ingests them through a scalable pipeline, stores them durably, indexes them for search, and supports querying, alerting, and retention.
The main challenge is handling very high write throughput while keeping search fast and storage cost under control.
3️⃣ Main APIs / Interfaces
Emit Log
Usually logs are not sent directly by application code to the central service.
They are usually written locally first:
Application → stdout / log file → agent
Log Ingestion API
POST /api/logs
Request:
{
"service": "payment-service",
"environment": "prod",
"level": "ERROR",
"message": "payment failed",
"timestamp": "2026-05-02T10:00:00Z",
"traceId": "trace-123",
"metadata": {
"orderId": "o789",
"region": "us-east-1"
}
}
Search Logs
GET /api/logs/search?service=payment-service&level=ERROR&from=...&to=...
Stream Logs
GET /api/logs/stream?service=payment-service
👉 Interview Answer
Applications usually write logs to stdout or local files.
A log agent collects those logs and sends them to the central ingestion pipeline.
The system should also expose search and streaming APIs for debugging, monitoring, and incident response.
4️⃣ Data Model
Log Event
{
"logId": "log-123",
"service": "payment-service",
"environment": "prod",
"level": "ERROR",
"timestamp": "2026-05-02T10:00:00Z",
"message": "payment failed",
"traceId": "trace-123",
"host": "host-1",
"region": "us-east-1",
"metadata": {
"orderId": "o789"
}
}
Common Indexed Fields
- timestamp
- service
- environment
- log level
- trace ID
- host
- region
- request ID
- user-defined tags
Raw Log vs Indexed Log
- Raw log = complete original event
- Indexed log = selected fields optimized for search
👉 Interview Answer
I would store logs as structured events whenever possible.
Common fields like timestamp, service, level, environment, trace ID, and host should be indexed.
The raw log should still be preserved, while selected fields are indexed for fast querying.
5️⃣ Log Collection
Option 1: Application Pushes Logs Directly
Application → Logging Service
Pros
- Simple conceptually
- Direct control from application
Cons
- Logging failure can affect application
- Adds latency to request path
- Harder to buffer locally
Option 2: Local Agent Collection
Application → stdout / file → Log Agent → Pipeline
Pros
- Decouples app from logging system
- Supports local buffering
- Standard production pattern
- Works well with containers and Kubernetes
Cons
- Agent needs deployment and monitoring
- Agent configuration complexity
Recommended
Use local log agents.
Examples:
Fluent Bit / Fluentd / Vector / Filebeat
👉 Interview Answer
I would avoid having applications synchronously send logs to the central logging service.
Instead, applications should write logs to stdout or local files, and a local agent collects, buffers, enriches, and forwards them asynchronously.
This prevents logging failures from impacting application latency.
6️⃣ Ingestion Pipeline
Basic Flow
Application
→ Log Agent
→ Ingestion Gateway
→ Message Queue / Log Bus
→ Processing Workers
→ Storage
→ Index
Ingestion Gateway Responsibilities
- Authentication
- Rate limiting
- Schema validation
- Compression handling
- Tenant/service tagging
- Load shedding when overloaded
Processing Workers
Responsibilities:
- Parse logs
- Normalize fields
- Enrich metadata
- Redact sensitive data
- Route to storage/index
- Generate alerts if needed
👉 Interview Answer
I would design the ingestion path as an asynchronous pipeline.
Log agents send logs to ingestion gateways, which validate, authenticate, and forward logs to a durable queue.
Processing workers then parse, normalize, enrich, redact, and route logs to storage and indexing systems.
7️⃣ Buffering and Backpressure
Why Needed?
Log volume can spike during incidents.
Ironically, when systems fail, they often generate more logs.
Buffer Layers
- Agent local disk buffer
- Ingestion gateway memory buffer
- Durable queue / log bus
- Storage write buffer
Backpressure Strategies
- Apply rate limiting per service / tenant
- Drop debug logs first
- Sample high-volume logs
- Prioritize ERROR logs
- Slow down non-critical producers
- Use local disk buffering in agents
👉 Interview Answer
Backpressure is essential because log volume often spikes during incidents.
I would use multiple buffering layers: local agent buffers, a durable queue, and storage write buffers.
If the system is overloaded, I would prioritize critical logs like ERROR, sample or drop DEBUG logs, and apply per-service rate limits.
8️⃣ Storage Design
Hot Storage
Used for recent logs.
Requirements:
- Fast search
- High write throughput
- Time-range queries
Examples:
Elasticsearch / OpenSearch / ClickHouse
Cold Storage
Used for long-term retention.
Requirements:
- Low cost
- Durable
- Slower query acceptable
Examples:
S3 / object storage
Recommended Storage Strategy
Recent logs → hot searchable storage
Older logs → cold object storage
Partitioning
Partition by:
date / hour
tenant
service
environment
👉 Interview Answer
I would separate hot and cold storage.
Recent logs go to hot storage for fast search, while older logs are moved to cheaper object storage for long-term retention.
Logs are naturally time-series data, so partitioning by time, tenant, and service is usually effective.
9️⃣ Indexing Strategy
What to Index?
Index common fields:
- timestamp
- service
- level
- environment
- trace ID
- request ID
- host
- region
Avoid indexing every field.
Why Not Index Everything?
- Expensive storage
- Slower writes
- Higher memory usage
- Explosive cardinality
High-cardinality Fields
Examples:
- user ID
- request ID
- session ID
- order ID
These can be useful but expensive.
👉 Interview Answer
I would not index every field by default.
Indexing improves query performance, but increases write cost, storage cost, and memory usage.
I would index common fields like timestamp, service, level, and trace ID, while controlling high-cardinality fields carefully.
🔟 Query and Search Flow
Search Flow
User submits query
→ Query service validates permission
→ Query planner selects shards/partitions
→ Search hot storage
→ Optionally query cold storage
→ Merge results
→ Return logs
Common Query Patterns
service = payment-service AND level = ERROR
traceId = trace-123
timestamp between T1 and T2
message contains "timeout"
Query Optimization
- Time-range filtering first
- Service/environment filters
- Index-based filtering
- Limit result size
- Cursor-based pagination
- Async query for large searches
👉 Interview Answer
Log queries are usually time-bound.
I would first narrow the query by time range, then by service, environment, and indexed fields.
For very large queries, I would run them asynchronously and limit result size to protect the system.
1️⃣1️⃣ Real-time Log Streaming
Use Cases
- Tail logs during deployment
- Debug production incident
- Watch one service in real time
Flow
Log ingestion
→ Stream processor
→ Pub/Sub topic
→ WebSocket / SSE
→ User UI
Challenges
- High fanout
- User permissions
- Backpressure
- Streaming too much data
👉 Interview Answer
Real-time log streaming can be implemented using pub/sub and WebSocket or server-sent events.
The system should enforce permissions, limit stream volume, and apply backpressure to avoid overwhelming the UI or backend.
1️⃣2️⃣ Retention, Sampling, and Cost Control
Retention Policy
Example:
ERROR logs: 90 days hot, 1 year cold
INFO logs: 14 days hot, 90 days cold
DEBUG logs: 1 day hot, 7 days cold
Sampling
Useful for high-volume logs.
Examples:
Keep 100% ERROR logs
Keep 10% INFO logs
Keep 1% DEBUG logs
Cost Control Strategies
- Different retention by log level
- Sampling low-value logs
- Compress logs
- Move old logs to cold storage
- Limit high-cardinality indexing
- Drop noisy logs at ingestion
👉 Interview Answer
Logging systems can become very expensive, so retention and sampling are important.
I would keep ERROR logs longer, sample INFO and DEBUG logs more aggressively, compress stored logs, and move older data to cold object storage.
1️⃣3️⃣ Alerting on Logs
Use Cases
- Error rate spike
- Specific exception pattern
- Security event detection
- Failed payment increase
Flow
Log stream
→ Rule engine
→ Alert condition matched
→ Notification system
→ Pager / Slack / Email
Design Notes
- Alert rules should be evaluated on streaming logs
- Avoid alerting on raw single log line when possible
- Use aggregation windows
- Deduplicate alerts
- Add severity and routing rules
👉 Interview Answer
Log-based alerting should be built on top of the log stream.
A rule engine can evaluate patterns or aggregations, such as error count per service over five minutes.
To avoid noisy alerts, I would use aggregation windows, deduplication, severity levels, and routing rules.
1️⃣4️⃣ Security and Privacy
Risks
- Logs may contain sensitive data
- Developers may log tokens or passwords
- Cross-tenant data exposure
- Unauthorized access to production logs
Protections
- Redact sensitive fields
- Encrypt logs at rest and in transit
- Role-based access control
- Tenant isolation
- Audit log access
- Data deletion support
- PII detection pipeline
👉 Interview Answer
Logs often contain sensitive data, so security and privacy are critical.
I would redact secrets and PII during processing, encrypt logs at rest and in transit, enforce role-based access control, and audit access to production logs.
1️⃣5️⃣ Failure Handling
Common Failures
- Agent cannot reach ingestion gateway
- Queue backlog
- Storage write failure
- Indexing delay
- Hot shard overload
- Query timeout
- Malformed log event
- Log storm during incidents
Strategies
- Local disk buffer in agent
- Durable queue
- Retry with backoff
- Dead-letter queue for bad logs
- Degrade search but keep ingestion
- Drop low-priority logs under pressure
- Re-index from raw storage
👉 Interview Answer
The logging system should prioritize ingestion durability.
If indexing is delayed, logs should still be stored in raw durable storage.
If the system is overloaded, it can drop or sample low-priority logs, but should preserve critical ERROR and security logs.
Since raw logs are stored durably, indexes can be rebuilt later.
1️⃣6️⃣ Consistency Model
Stronger Consistency Needed For
- Security audit logs
- Compliance logs
- Raw log durability
- Access control decisions
Eventual Consistency Acceptable For
- Search index updates
- Alerting delay
- Dashboard metrics
- Cold storage availability
- Aggregated statistics
👉 Interview Answer
Logging systems usually accept eventual consistency for search.
It is acceptable if a log line becomes searchable after a short delay.
However, raw log durability and security audit logs need stronger guarantees, because they may be used for compliance or incident investigation.
1️⃣7️⃣ Observability of the Logging System
Key Metrics
- Ingestion QPS
- Ingestion latency
- Queue lag
- Dropped log count
- Indexed log count
- Indexing lag
- Storage write error rate
- Query latency
- Query error rate
- Hot shard count
- Cost per service / tenant
Important Dashboards
- Ingestion health
- Queue backlog
- Indexing freshness
- Search performance
- Storage growth
- Dropped / sampled logs
- Tenant usage and cost
👉 Interview Answer
The logging system itself must be observable.
I would monitor ingestion QPS, queue lag, dropped logs, indexing delay, storage errors, and query latency.
This helps detect whether we are losing logs, falling behind, or spending too much on storage.
1️⃣8️⃣ End-to-End Flow
Log Ingestion Flow
Application writes log
→ Local agent reads stdout/file
→ Agent buffers and batches logs
→ Ingestion gateway validates request
→ Queue stores logs durably
→ Workers parse and enrich logs
→ Raw logs stored
→ Indexed fields sent to search index
Search Flow
User submits query
→ Permission check
→ Query planner selects time partitions
→ Search hot index
→ Fetch raw log content
→ Merge and paginate results
→ Return response
Alert Flow
Log stream
→ Rule engine
→ Aggregation window
→ Alert condition matched
→ Notification system
→ On-call user notified
Key Insight
Logging System is a high-throughput data pipeline, not just a place to store text.
🧠 Staff-Level Answer (Final)
👉 Interview Answer (Full Version)
When designing a logging system, I think of it as a high-throughput data pipeline for collecting, storing, indexing, searching, and alerting on logs.
Applications should not synchronously send logs to the central logging service. Instead, they write logs to stdout or local files, and local agents collect, buffer, enrich, and forward logs asynchronously.
The ingestion path should use gateways, durable queues, processing workers, raw storage, and search indexes.
Recent logs should go to hot storage for fast search, while older logs should be moved to cheaper cold storage.
I would index common fields like timestamp, service, environment, level, trace ID, and host, but avoid indexing every field because high-cardinality indexing can be very expensive.
For query serving, I would first narrow by time range, then use indexed fields like service and level, and support pagination or async queries for large searches.
Backpressure is critical because log volume often spikes during incidents. I would use local buffering, durable queues, rate limiting, sampling, and priority handling to preserve critical logs while controlling cost.
Security is also important: logs should be encrypted, sensitive data should be redacted, and access should be controlled and audited.
The main trade-offs are ingestion durability, search freshness, query latency, storage cost, and indexing complexity.
Ultimately, the goal is to reliably collect massive log volume, make recent logs searchable quickly, retain older logs cost-effectively, and support debugging, monitoring, and incident response.
⭐ Final Insight
Logging System 的核心不是存文本, 而是构建一个高吞吐、可搜索、可控成本、可审计的数据管道。
中文部分
🎯 Design Logging System
1️⃣ 核心框架
在设计 Logging System 时,我通常从以下几个方面来分析:
- 日志生成和采集
- 日志 ingestion pipeline
- Buffer、queue 和 backpressure
- 日志存储和索引
- Search 和 query serving
- Retention、sampling 和成本控制
- 可靠性、顺序和故障处理
- 安全、访问控制和可观测性
2️⃣ 核心需求
功能需求
- 应用可以产生日志
- 支持结构化和非结构化日志
- 支持从大量服务采集日志
- 支持按时间范围、服务、日志等级、trace ID 搜索
- 支持实时 log streaming
- 支持 retention policies
- 支持基于日志模式触发告警
- 支持访问控制
非功能需求
- 高写入吞吐
- 低 ingestion 延迟
- 日志持久化存储
- 可扩展的查询性能
- 高可用
- 长期保存成本可控
- Search 可以接受最终一致
👉 面试回答
Logging System 会从大量应用中收集日志, 通过可扩展的 pipeline 进行 ingestion, 将日志持久化存储, 并建立索引用于搜索。
同时,它还需要支持查询、告警和 retention。
核心挑战是在处理极高写入吞吐的同时, 保持搜索性能并控制存储成本。
3️⃣ 主要 API / 接口
Emit Log
通常 application code 不会直接同步发送日志到中心系统。
一般是先写到本地:
Application → stdout / log file → agent
Log Ingestion API
POST /api/logs
Request:
{
"service": "payment-service",
"environment": "prod",
"level": "ERROR",
"message": "payment failed",
"timestamp": "2026-05-02T10:00:00Z",
"traceId": "trace-123",
"metadata": {
"orderId": "o789",
"region": "us-east-1"
}
}
Search Logs
GET /api/logs/search?service=payment-service&level=ERROR&from=...&to=...
Stream Logs
GET /api/logs/stream?service=payment-service
👉 面试回答
应用通常会将日志写到 stdout 或本地文件。
Log agent 负责采集这些日志, 并发送到中心化 ingestion pipeline。
系统也需要提供 search 和 streaming API, 用于 debugging、monitoring 和 incident response。
4️⃣ 数据模型
Log Event
{
"logId": "log-123",
"service": "payment-service",
"environment": "prod",
"level": "ERROR",
"timestamp": "2026-05-02T10:00:00Z",
"message": "payment failed",
"traceId": "trace-123",
"host": "host-1",
"region": "us-east-1",
"metadata": {
"orderId": "o789"
}
}
常见索引字段
- timestamp
- service
- environment
- log level
- trace ID
- host
- region
- request ID
- user-defined tags
Raw Log vs Indexed Log
- Raw log = 完整原始事件
- Indexed log = 为搜索优化的部分字段
👉 面试回答
我会尽量将日志设计成结构化事件。
常见字段,比如 timestamp、service、level、 environment、trace ID 和 host 应该被索引。
原始日志仍然应该保留, 同时只选择必要字段建立索引以支持快速查询。
5️⃣ 日志采集
方案 1:Application 直接 Push Logs
Application → Logging Service
优点
- 概念简单
- 应用直接控制
缺点
- 日志系统失败可能影响应用
- 增加请求路径延迟
- 难以本地 buffer
方案 2:Local Agent Collection
Application → stdout / file → Log Agent → Pipeline
优点
- 应用和 logging system 解耦
- 支持本地 buffer
- 是生产环境常见模式
- 非常适合 containers 和 Kubernetes
缺点
- 需要部署和监控 agent
- Agent 配置有复杂度
推荐方案
使用 local log agents。
例如:
Fluent Bit / Fluentd / Vector / Filebeat
👉 面试回答
我会避免让 application 同步发送日志到中心化 logging service。
更好的方式是让应用写 stdout 或本地文件, 然后由本地 agent 异步采集、buffer、enrich 并转发日志。
这样可以避免 logging system 故障影响应用请求延迟。
6️⃣ Ingestion Pipeline
基本流程
Application
→ Log Agent
→ Ingestion Gateway
→ Message Queue / Log Bus
→ Processing Workers
→ Storage
→ Index
Ingestion Gateway 职责
- Authentication
- Rate limiting
- Schema validation
- Compression handling
- Tenant / service tagging
- 系统过载时 load shedding
Processing Workers 职责
- Parse logs
- Normalize fields
- Enrich metadata
- Redact sensitive data
- Route to storage / index
- 必要时生成 alerts
👉 面试回答
我会将 ingestion path 设计成异步 pipeline。
Log agents 将日志发送到 ingestion gateways, gateways 负责校验、认证, 然后将日志写入 durable queue。
Processing workers 再进行解析、标准化、元数据补充、 敏感数据脱敏, 并将日志路由到存储和索引系统。
7️⃣ Buffering and Backpressure
为什么需要?
日志量在 incident 期间经常会暴增。
很讽刺的是, 系统出故障时通常会产生更多日志。
Buffer Layers
- Agent local disk buffer
- Ingestion gateway memory buffer
- Durable queue / log bus
- Storage write buffer
Backpressure Strategies
- 按 service / tenant 限流
- 优先丢弃 DEBUG logs
- 对高流量日志做 sampling
- 优先保留 ERROR logs
- 降低非关键 producer 速率
- Agent 使用本地磁盘 buffer
👉 面试回答
Backpressure 非常重要, 因为日志量经常会在 incident 期间暴增。
我会使用多层 buffer: local agent buffer、durable queue 和 storage write buffer。
如果系统过载, 我会优先保留 ERROR 等关键日志, 对 DEBUG logs 做采样或丢弃, 并按 service 进行限流。
8️⃣ 存储设计
Hot Storage
用于近期日志。
要求:
- 快速搜索
- 高写入吞吐
- 支持时间范围查询
例如:
Elasticsearch / OpenSearch / ClickHouse
Cold Storage
用于长期保留。
要求:
- 成本低
- 持久
- 查询慢一些可以接受
例如:
S3 / object storage
推荐存储策略
Recent logs → hot searchable storage
Older logs → cold object storage
Partitioning
按以下维度分区:
date / hour
tenant
service
environment
👉 面试回答
我会将 hot storage 和 cold storage 分开。
最近日志进入 hot storage, 用于快速搜索; 老日志迁移到更便宜的 object storage, 用于长期保存。
日志天然是 time-series data, 所以按时间、tenant 和 service 分区通常很有效。
9️⃣ Indexing Strategy
索引什么?
常见字段:
- timestamp
- service
- level
- environment
- trace ID
- request ID
- host
- region
不要默认索引所有字段。
为什么不索引所有字段?
- 存储成本高
- 写入变慢
- 内存使用更高
- 高基数字段爆炸
High-cardinality Fields
例如:
- user ID
- request ID
- session ID
- order ID
这些字段有用, 但索引成本高。
👉 面试回答
我不会默认索引所有字段。
索引可以提升查询性能, 但会增加写入成本、存储成本和内存使用。
我会索引 timestamp、service、level 和 trace ID 这类常用字段, 同时谨慎控制高基数字段。
🔟 Query and Search Flow
Search Flow
User submits query
→ Query service validates permission
→ Query planner selects shards/partitions
→ Search hot storage
→ Optionally query cold storage
→ Merge results
→ Return logs
常见 Query Patterns
service = payment-service AND level = ERROR
traceId = trace-123
timestamp between T1 and T2
message contains "timeout"
Query Optimization
- 先按时间范围过滤
- 再按 service / environment 过滤
- 使用索引字段过滤
- 限制结果大小
- Cursor-based pagination
- 大查询使用 async query
👉 面试回答
Log query 通常都有时间范围。
我会先通过 time range 缩小查询范围, 然后再用 service、environment 和 indexed fields 过滤。
对于非常大的查询, 我会使用异步查询, 并限制结果大小来保护系统。
1️⃣1️⃣ Real-time Log Streaming
Use Cases
- 部署时 tail logs
- Debug production incident
- 实时观察某个服务日志
Flow
Log ingestion
→ Stream processor
→ Pub/Sub topic
→ WebSocket / SSE
→ User UI
Challenges
- High fanout
- User permissions
- Backpressure
- Streaming too much data
👉 面试回答
Real-time log streaming 可以基于 pub/sub 以及 WebSocket 或 server-sent events 实现。
系统需要强制权限检查, 限制 stream volume, 并使用 backpressure, 避免压垮 UI 或后端。
1️⃣2️⃣ Retention, Sampling, and Cost Control
Retention Policy
示例:
ERROR logs: 90 days hot, 1 year cold
INFO logs: 14 days hot, 90 days cold
DEBUG logs: 1 day hot, 7 days cold
Sampling
适合高流量日志。
示例:
Keep 100% ERROR logs
Keep 10% INFO logs
Keep 1% DEBUG logs
Cost Control Strategies
- 根据 log level 设置不同 retention
- 对低价值日志 sampling
- 压缩日志
- 将老日志移到 cold storage
- 限制高基数字段索引
- 在 ingestion 阶段丢弃 noisy logs
👉 面试回答
Logging system 很容易变得非常昂贵, 所以 retention 和 sampling 很重要。
我会让 ERROR logs 保留更久, 对 INFO 和 DEBUG logs 更激进地采样, 压缩存储日志, 并将旧数据迁移到 cold object storage。
1️⃣3️⃣ Alerting on Logs
Use Cases
- Error rate spike
- Specific exception pattern
- Security event detection
- Failed payment increase
Flow
Log stream
→ Rule engine
→ Alert condition matched
→ Notification system
→ Pager / Slack / Email
Design Notes
- Alert rules should be evaluated on streaming logs
- 尽量不要基于单条日志直接告警
- 使用 aggregation windows
- Deduplicate alerts
- 加入 severity 和 routing rules
👉 面试回答
Log-based alerting 可以构建在 log stream 之上。
Rule engine 可以评估日志模式或聚合条件, 例如五分钟内某个服务的 error count。
为了避免 noisy alerts, 我会使用聚合窗口、去重、严重等级 和 routing rules。
1️⃣4️⃣ Security and Privacy
Risks
- Logs may contain sensitive data
- Developers may log tokens or passwords
- Cross-tenant data exposure
- Unauthorized access to production logs
Protections
- Redact sensitive fields
- Encrypt logs at rest and in transit
- Role-based access control
- Tenant isolation
- Audit log access
- Data deletion support
- PII detection pipeline
👉 面试回答
Logs 经常包含敏感数据, 所以安全和隐私非常关键。
我会在 processing 阶段脱敏 secrets 和 PII, 对日志进行传输和静态加密, 强制 RBAC, 并审计 production logs 的访问行为。
1️⃣5️⃣ Failure Handling
Common Failures
- Agent cannot reach ingestion gateway
- Queue backlog
- Storage write failure
- Indexing delay
- Hot shard overload
- Query timeout
- Malformed log event
- Log storm during incidents
Strategies
- Agent local disk buffer
- Durable queue
- Retry with backoff
- Dead-letter queue for bad logs
- Degrade search but keep ingestion
- Drop low-priority logs under pressure
- Re-index from raw storage
👉 面试回答
Logging system 应该优先保证 ingestion durability。
如果 indexing 延迟, 日志仍然应该先写入 raw durable storage。
如果系统过载, 可以丢弃或采样低优先级日志, 但应该尽量保留 ERROR 和 security logs。
由于 raw logs 已经持久化, index 可以之后重建。
1️⃣6️⃣ Consistency Model
需要较强一致性的场景
- Security audit logs
- Compliance logs
- Raw log durability
- Access control decisions
可以最终一致的场景
- Search index updates
- Alerting delay
- Dashboard metrics
- Cold storage availability
- Aggregated statistics
👉 面试回答
Logging system 通常可以接受 search 最终一致。
一条日志晚几秒变得可搜索, 通常是可以接受的。
但是 raw log durability 和 security audit logs 需要更强保证, 因为它们可能用于合规或 incident investigation。
1️⃣7️⃣ Logging System 自身的可观测性
Key Metrics
- Ingestion QPS
- Ingestion latency
- Queue lag
- Dropped log count
- Indexed log count
- Indexing lag
- Storage write error rate
- Query latency
- Query error rate
- Hot shard count
- Cost per service / tenant
Important Dashboards
- Ingestion health
- Queue backlog
- Indexing freshness
- Search performance
- Storage growth
- Dropped / sampled logs
- Tenant usage and cost
👉 面试回答
Logging system 自身也必须可观测。
我会监控 ingestion QPS、queue lag、 dropped logs、indexing delay、storage errors 和 query latency。
这样可以帮助我们判断是否正在丢日志、 是否处理落后, 以及存储成本是否过高。
1️⃣8️⃣ End-to-End Flow
Log Ingestion Flow
Application writes log
→ Local agent reads stdout/file
→ Agent buffers and batches logs
→ Ingestion gateway validates request
→ Queue stores logs durably
→ Workers parse and enrich logs
→ Raw logs stored
→ Indexed fields sent to search index
Search Flow
User submits query
→ Permission check
→ Query planner selects time partitions
→ Search hot index
→ Fetch raw log content
→ Merge and paginate results
→ Return response
Alert Flow
Log stream
→ Rule engine
→ Aggregation window
→ Alert condition matched
→ Notification system
→ On-call user notified
Key Insight
Logging System 是高吞吐数据管道, 不是简单存储文本的地方。
🧠 Staff-Level Answer(最终版)
👉 面试回答(完整背诵版)
在设计 Logging System 时, 我会把它看作一个高吞吐数据管道, 用于日志采集、存储、索引、搜索和告警。
应用不应该同步发送日志到中心化 logging service。 更好的方式是应用写 stdout 或本地文件, 然后由本地 agent 异步采集、buffer、enrich 并转发日志。
Ingestion path 应该包含 gateways、durable queues、 processing workers、raw storage 和 search indexes。
最近日志应该进入 hot storage, 用于快速搜索; 老日志应该迁移到更便宜的 cold storage, 用于长期保留。
我会索引 timestamp、service、environment、level、 trace ID 和 host 等常见字段, 但不会默认索引所有字段, 因为高基数字段索引会非常昂贵。
对于 query serving, 我会先通过 time range 缩小范围, 再使用 service、level 等索引字段过滤, 对大查询使用 pagination 或 async query。
Backpressure 非常关键, 因为 incident 期间日志量通常会暴增。 我会使用本地 buffer、durable queue、 rate limiting、sampling 和 priority handling, 在控制成本的同时保留关键日志。
安全也很重要: 日志应该加密, 敏感数据应该脱敏, 访问权限应该被控制和审计。
核心权衡包括 ingestion durability、 search freshness、query latency、storage cost 和 indexing complexity。
最终目标是可靠采集海量日志, 让近期日志快速可搜索, 低成本保留历史日志, 并支持 debugging、monitoring 和 incident response。
⭐ Final Insight
Logging System 的核心不是存文本, 而是构建一个高吞吐、可搜索、可控成本、可审计的数据管道。
Implement