🎯 Design Event Tracking System
1️⃣ Core Framework
When discussing Event Tracking System design, I frame it as:
- Event generation and SDK design
- Event schema and validation
- Ingestion pipeline
- Stream processing and batch processing
- Storage for raw events and analytics
- Deduplication, ordering, and sessionization
- Privacy, consent, and compliance
- Trade-offs: accuracy vs latency vs cost
2️⃣ Core Requirements
Functional Requirements
-
Track user events:
- page view
- click
- impression
- purchase
- signup
- search
- Support web, mobile, and backend events
- Support event schema validation
- Support real-time analytics
- Support batch analytics
- Support dashboards and funnels
- Support attribution
- Support data export
Non-functional Requirements
- High write throughput
- Low ingestion latency
- Durable raw event storage
- Scalable processing pipeline
- Exactly-once effect where possible
- Eventual consistency acceptable for analytics
- Strong privacy and access control
- Cost-efficient long-term storage
👉 Interview Answer
An event tracking system collects user and system events, validates them, stores raw events durably, processes them into analytics tables, and powers dashboards, funnels, attribution, and experimentation.
The main challenges are high write throughput, data quality, deduplication, privacy, and cost control.
3️⃣ Main APIs
Track Event
POST /api/events
Request:
{
"eventId": "evt_123",
"userId": "u456",
"anonymousId": "anon_789",
"eventName": "product_clicked",
"timestamp": "2026-05-03T10:00:00Z",
"properties": {
"productId": "p123",
"category": "shoes",
"surface": "home"
},
"context": {
"device": "mobile",
"appVersion": "1.2.3",
"ip": "1.2.3.4"
}
}
Batch Track Events
POST /api/events/batch
Query Analytics
GET /api/analytics/funnel?name=checkout
Export Events
POST /api/events/export
👉 Interview Answer
The most important API is the track event API.
Clients should also support batch uploads, because mobile and web SDKs may buffer events and send them in batches to reduce network overhead.
Analytics queries should be served from processed tables, not raw ingestion APIs.
4️⃣ Event Schema
Common Event Fields
{
"eventId": "unique event id",
"eventName": "product_clicked",
"userId": "known user id",
"anonymousId": "anonymous device/session id",
"timestamp": "client event time",
"receivedAt": "server received time",
"properties": {},
"context": {}
}
Why Schema Matters
- Prevent bad data
- Support analytics consistency
- Improve query reliability
- Enable governance
- Reduce downstream pipeline failures
Schema Registry
Use a schema registry to define:
- Required fields
- Allowed event names
- Property types
- Versioning
- Deprecation rules
👉 Interview Answer
Event schema is critical for data quality.
I would use a schema registry to define allowed event names, required fields, property types, and schema versions.
Bad events should be rejected, quarantined, or routed to a dead-letter queue.
5️⃣ High-Level Architecture
Client SDK / Backend Service
→ Event Collector
→ Validation / Enrichment
→ Message Queue / Event Log
→ Stream Processing
→ Raw Event Storage
→ Analytics Warehouse
→ Dashboard / Experiment / Attribution
Main Components
Client SDK
- Generates event ID
- Buffers events
- Retries failed sends
- Handles batching
Event Collector
- Receives events
- Authenticates clients
- Rate limits traffic
- Validates payloads
Stream Processor
- Deduplicates events
- Enriches events
- Computes real-time metrics
- Writes to downstream stores
Data Warehouse
- Stores processed analytics data
- Supports dashboards and ad hoc queries
👉 Interview Answer
I would design event tracking as a high-throughput data pipeline.
SDKs collect and batch events, collectors validate and enrich events, queues provide durability, stream processors compute real-time outputs, and raw events are stored for replay and offline analytics.
6️⃣ Client SDK Design
Responsibilities
- Generate event IDs
- Attach user/session/device context
- Buffer events locally
- Batch send events
- Retry with backoff
- Respect user consent
- Drop events when storage is full
Mobile Considerations
- App may go offline
- Battery usage matters
- Network usage matters
- Events may arrive late
Web Considerations
- Page may close before events are sent
- Use beacon API when possible
- Avoid blocking page navigation
👉 Interview Answer
Client SDK design is important because events are generated at the edge.
The SDK should buffer events, batch uploads, retry failures, attach context, and respect user consent.
It should never block the user experience.
7️⃣ Ingestion Pipeline
Basic Flow
Event received
→ Authenticate source
→ Validate schema
→ Add received timestamp
→ Enrich context
→ Write to durable event log
→ Return success
Event Collector Responsibilities
- Authentication
- Rate limiting
- Schema validation
- Payload size validation
- Timestamp normalization
- IP / geo enrichment
- User-agent parsing
- Consent filtering
Durable Queue
Common choice:
Kafka / Pulsar / Kinesis
Why?
- High throughput
- Durable buffering
- Replayability
- Backpressure handling
👉 Interview Answer
The ingestion service should do lightweight validation and enrichment, then write events to a durable log such as Kafka.
Once the event is durably written, the collector can acknowledge success.
Heavy processing should happen asynchronously.
8️⃣ Deduplication and Idempotency
Why Needed?
Duplicate events can happen due to:
- Client retries
- Network timeout
- SDK resend
- Collector retry
- Stream processor retry
Deduplication Key
Use:
eventId
or:
source + eventId
Dedup Strategy
- Short-term dedupe cache
- Persistent dedupe table for critical events
- Idempotent downstream writes
- Exactly-once stream processing where supported
👉 Interview Answer
Event systems should assume duplicates will happen.
Every event should include a unique event ID.
The pipeline can use dedupe caches, idempotent writes, and event IDs to avoid double-counting important events.
9️⃣ Ordering and Late Events
Ordering Problem
Events may arrive out of order.
Example:
purchase event arrives before add_to_cart
Timestamps
Store both:
event_time = when event happened on client
received_at = when server received it
Late Event Handling
Use:
- Watermarks
- Allowed lateness window
- Reprocessing jobs
- Backfill correction
👉 Interview Answer
Event ordering is not guaranteed, especially with mobile clients.
I would store both client event time and server received time.
For analytics, stream processors can use watermarks and allowed lateness windows to handle delayed events.
🔟 Sessionization
What Is a Session?
A group of user events within an activity window.
Example rule:
new session if user inactive for 30 minutes
Sessionization Flow
User events
→ Group by user / anonymousId
→ Sort by event time
→ Split by inactivity gap
→ Assign sessionId
Uses
- Session length
- Funnel analysis
- Conversion analysis
- Retention metrics
- Attribution
👉 Interview Answer
Sessionization groups events into user sessions.
A common rule is to start a new session after 30 minutes of inactivity.
This is useful for funnels, retention, attribution, and user behavior analysis.
1️⃣1️⃣ Stream Processing
Real-time Use Cases
- Live dashboards
- Real-time alerts
- Experiment metrics
- Fraud detection
- Real-time personalization
- Operational monitoring
Flow
Kafka topic
→ Stream processor
→ Aggregate by time window
→ Write to real-time analytics store
Examples
page_views_per_minute
checkout_conversion_rate
purchase_count_by_region
👉 Interview Answer
Stream processing powers near-real-time analytics.
It consumes events from the durable log, performs deduplication and aggregation, and writes metrics to a real-time analytics store.
1️⃣2️⃣ Batch Processing
Batch Use Cases
- Daily reporting
- Long-term analytics
- Attribution
- Model training
- Data quality checks
- Backfills
Flow
Raw events in object storage
→ Batch job
→ Clean and transform
→ Write analytics tables
→ Power BI / dashboard / ML training
Why Keep Raw Events?
- Replay pipeline
- Fix bugs
- Backfill new metrics
- Audit data
- Train models
👉 Interview Answer
Raw events should be stored durably, usually in object storage.
This allows replay, backfill, data correction, and offline analytics.
Batch processing creates reliable analytics tables for reporting.
1️⃣3️⃣ Storage Design
Raw Event Storage
Use:
S3 / object storage
Partition by:
date / hour / eventName / source
Real-time Analytics Store
Use:
ClickHouse / Druid / Pinot / Elasticsearch
Good for:
- Fast aggregates
- Dashboards
- Time-range queries
Data Warehouse
Use:
BigQuery / Snowflake / Hive / Redshift
Good for:
- Historical analytics
- Ad hoc SQL
- Model training
👉 Interview Answer
I would use multiple storage systems.
Raw events go to object storage for durability and replay.
Real-time aggregates go to a low-latency analytics store.
Curated historical tables go to a data warehouse for reporting and analysis.
1️⃣4️⃣ Analytics Use Cases
Funnels
Example:
view_product
→ add_to_cart
→ checkout_started
→ purchase_completed
Retention
Example:
Day 1 / Day 7 / Day 30 retention
Attribution
Example:
Which campaign caused purchase?
Experimentation
Example:
Variant A vs Variant B conversion rate
👉 Interview Answer
Event tracking powers funnels, retention, attribution, experimentation, personalization, and product analytics.
The data model should make it easy to join user events, sessions, experiments, and business outcomes.
1️⃣5️⃣ Privacy and Compliance
Sensitive Data Risks
Events may contain:
- User IDs
- IP addresses
- Device IDs
- Location
- Payment-related metadata
- Health or financial signals
Requirements
- Consent management
- Data minimization
- PII redaction
- Encryption
- Access control
- Audit logs
- User deletion support
- Retention policy
Important Rule
Do not allow arbitrary PII in event properties.
👉 Interview Answer
Event tracking can easily collect sensitive data.
I would enforce consent checks, data minimization, PII redaction, encryption, access control, retention limits, and deletion support.
Event schemas should prevent arbitrary PII from entering properties.
1️⃣6️⃣ Data Quality
Common Issues
- Missing required fields
- Wrong property type
- Duplicate events
- Late events
- Inconsistent event names
- Client clock skew
- Bot traffic
- Version mismatch
Strategies
- Schema validation
- Data contracts
- Event versioning
- Dead-letter queue
- Data quality dashboards
- Bot filtering
- Sampling audits
- Alert on metric anomalies
👉 Interview Answer
Data quality is one of the hardest parts of event tracking.
I would enforce schemas, version events, route invalid events to a dead-letter queue, and monitor data quality metrics such as missing fields, duplicates, late events, and schema violations.
1️⃣7️⃣ Scaling Patterns
Pattern 1: Batch Events at Client
Reduce network overhead.
Pattern 2: Durable Event Log
Kafka/Pulsar/Kinesis absorbs spikes and enables replay.
Pattern 3: Separate Hot and Cold Paths
- Hot path = real-time metrics
- Cold path = batch analytics
Pattern 4: Partition by Time and Event Type
Improves storage and query efficiency.
Pattern 5: Schema Registry
Prevents pipeline-breaking events.
👉 Interview Answer
To scale event tracking, I would batch events at the client, write events to a durable log, separate real-time and batch processing, partition raw storage by time and event type, and enforce schemas with a registry.
1️⃣8️⃣ Failure Handling
Common Failures
- Client offline
- Duplicate event send
- Collector overload
- Kafka lag
- Stream processor failure
- Bad event schema
- Data warehouse load failure
- Late event backfill
Strategies
- Client buffering
- Retry with backoff
- Rate limiting
- Durable queue
- Dead-letter queue
- Replay from raw events
- Backfill jobs
- Idempotent processing
👉 Interview Answer
The system should assume failures and duplicates.
Clients buffer and retry events. Collectors write to a durable log. Invalid events go to a dead-letter queue. Raw events are stored durably so downstream pipelines can be replayed or backfilled.
1️⃣9️⃣ Consistency Model
Stronger Consistency Needed For
- Consent and privacy enforcement
- User deletion requests
- Billing-related events
- Experiment assignment logs
- Critical conversion events
Eventual Consistency Acceptable For
- Dashboards
- Funnels
- Retention reports
- Attribution reports
- Personalization features
- Aggregated analytics
👉 Interview Answer
Most analytics can be eventually consistent.
A dashboard being delayed by a few minutes is usually acceptable.
But consent, deletion requests, experiment assignment, and critical billing or conversion events require stronger correctness.
2️⃣0️⃣ Observability
Key Metrics
- Event ingestion QPS
- Collector latency
- Collector error rate
- Queue lag
- Duplicate event rate
- Invalid event rate
- Late event rate
- Stream processing lag
- Raw storage write failure
- Warehouse load latency
- Data freshness
👉 Interview Answer
I would monitor ingestion QPS, collector latency, queue lag, invalid event rate, duplicate rate, late event rate, stream processing lag, warehouse load status, and data freshness.
These metrics show whether the analytics pipeline is healthy and trustworthy.
2️⃣1️⃣ End-to-End Flow
Event Ingestion Flow
User clicks button
→ SDK creates event with eventId
→ SDK buffers and batches events
→ Collector validates schema
→ Event written to Kafka
→ Raw event stored in object storage
Real-time Analytics Flow
Kafka event stream
→ Stream processor deduplicates
→ Aggregate by time window
→ Write real-time metrics
→ Dashboard updates
Batch Analytics Flow
Raw events in object storage
→ Batch job cleans and transforms
→ Sessionization / attribution / funnels
→ Write analytics tables
→ Reports and ML training
Key Insight
Event Tracking System is not just logging clicks — it is the data foundation for analytics, experiments, attribution, and product decisions.
🧠 Staff-Level Answer (Final)
👉 Interview Answer (Full Version)
When designing an event tracking system, I think of it as a high-throughput analytics data pipeline.
Events are generated by web clients, mobile clients, and backend services.
SDKs should generate event IDs, attach context, buffer events, batch uploads, retry failures, and respect user consent.
The ingestion service validates events, enriches them with received timestamp and context, then writes them to a durable event log such as Kafka.
Raw events should be stored durably in object storage so the system can replay, backfill, audit, and rebuild downstream tables.
Stream processing powers near-real-time dashboards, alerts, experiments, and personalization.
Batch processing powers reliable historical analytics, attribution, sessionization, retention, and model training.
Data quality is critical. I would use a schema registry, event versioning, dead-letter queues, deduplication, and data quality dashboards.
Because events may be duplicated or arrive late, every event should have a unique event ID, and the system should store both event time and received time.
Privacy is also critical. The system must enforce consent, data minimization, PII redaction, encryption, access control, retention, and deletion policies.
The main trade-offs are accuracy, latency, cost, privacy, and processing complexity.
Ultimately, the goal is to provide reliable, trusted, and scalable behavioral data for analytics, experimentation, attribution, personalization, and business decisions.
⭐ Final Insight
Event Tracking System 的核心不是简单记录点击, 而是一个支撑 analytics、A/B testing、attribution、personalization 和业务决策的数据基础设施。
中文部分
🎯 Design Event Tracking System
1️⃣ 核心框架
在设计 Event Tracking System 时,我通常从以下几个方面分析:
- Event generation 和 SDK design
- Event schema 和 validation
- Ingestion pipeline
- Stream processing 和 batch processing
- Raw events 和 analytics storage
- Deduplication、ordering 和 sessionization
- Privacy、consent 和 compliance
- 核心权衡:accuracy vs latency vs cost
2️⃣ 核心需求
功能需求
-
追踪用户事件:
- page view
- click
- impression
- purchase
- signup
- search
- 支持 web、mobile 和 backend events
- 支持 event schema validation
- 支持 real-time analytics
- 支持 batch analytics
- 支持 dashboards 和 funnels
- 支持 attribution
- 支持 data export
非功能需求
- 高写入吞吐
- 低 ingestion 延迟
- Raw event 持久化存储
- 可扩展 processing pipeline
- 尽可能实现 exactly-once effect
- Analytics 可以最终一致
- 强 privacy 和 access control
- 长期存储成本可控
👉 面试回答
Event Tracking System 收集用户和系统事件, 校验事件, 持久化保存 raw events, 将它们处理成 analytics tables, 并支持 dashboards、funnels、attribution 和 experimentation。
核心挑战包括高写入吞吐、数据质量、 去重、隐私和成本控制。
3️⃣ 主要 API
Track Event
POST /api/events
Request:
{
"eventId": "evt_123",
"userId": "u456",
"anonymousId": "anon_789",
"eventName": "product_clicked",
"timestamp": "2026-05-03T10:00:00Z",
"properties": {
"productId": "p123",
"category": "shoes",
"surface": "home"
},
"context": {
"device": "mobile",
"appVersion": "1.2.3",
"ip": "1.2.3.4"
}
}
Batch Track Events
POST /api/events/batch
Query Analytics
GET /api/analytics/funnel?name=checkout
Export Events
POST /api/events/export
👉 面试回答
最重要的 API 是 track event API。
Clients 也应该支持 batch uploads, 因为 mobile 和 web SDKs 可能会 buffer events, 并批量发送以减少网络开销。
Analytics queries 应该从 processed tables 查询, 而不是直接查询 ingestion API。
4️⃣ Event Schema
Common Event Fields
{
"eventId": "unique event id",
"eventName": "product_clicked",
"userId": "known user id",
"anonymousId": "anonymous device/session id",
"timestamp": "client event time",
"receivedAt": "server received time",
"properties": {},
"context": {}
}
为什么 Schema 重要?
- 防止坏数据
- 保证 analytics 一致性
- 提升 query reliability
- 支持 data governance
- 减少 downstream pipeline failure
Schema Registry
Schema registry 定义:
- Required fields
- Allowed event names
- Property types
- Versioning
- Deprecation rules
👉 面试回答
Event schema 对数据质量非常关键。
我会使用 schema registry 定义允许的 event names、required fields、 property types 和 schema versions。
Bad events 应该被 reject、quarantine 或进入 dead-letter queue。
5️⃣ High-Level Architecture
Client SDK / Backend Service
→ Event Collector
→ Validation / Enrichment
→ Message Queue / Event Log
→ Stream Processing
→ Raw Event Storage
→ Analytics Warehouse
→ Dashboard / Experiment / Attribution
Main Components
Client SDK
- 生成 event ID
- Buffer events
- Retry failed sends
- 支持 batching
Event Collector
- 接收 events
- 验证 clients
- Rate limit traffic
- Validate payloads
Stream Processor
- Deduplicate events
- Enrich events
- 计算 real-time metrics
- 写入 downstream stores
Data Warehouse
- 存储 processed analytics data
- 支持 dashboards 和 ad hoc queries
👉 面试回答
我会将 event tracking 设计成高吞吐数据 pipeline。
SDKs 收集并批量发送 events; collectors 校验和 enrich events; queues 提供 durability; stream processors 计算 real-time outputs; raw events 被存储用于 replay 和 offline analytics。
6️⃣ Client SDK Design
Responsibilities
- Generate event IDs
- Attach user / session / device context
- Buffer events locally
- Batch send events
- Retry with backoff
- Respect user consent
- Drop events when storage is full
Mobile Considerations
- App may go offline
- Battery usage matters
- Network usage matters
- Events may arrive late
Web Considerations
- Page may close before events are sent
- Use beacon API when possible
- Avoid blocking page navigation
👉 面试回答
Client SDK design 很重要, 因为 events 是在 edge 产生的。
SDK 应该 buffer events、batch uploads、 retry failures、attach context, 并尊重 user consent。
它绝不能阻塞用户体验。
7️⃣ Ingestion Pipeline
Basic Flow
Event received
→ Authenticate source
→ Validate schema
→ Add received timestamp
→ Enrich context
→ Write to durable event log
→ Return success
Event Collector Responsibilities
- Authentication
- Rate limiting
- Schema validation
- Payload size validation
- Timestamp normalization
- IP / geo enrichment
- User-agent parsing
- Consent filtering
Durable Queue
常见选择:
Kafka / Pulsar / Kinesis
原因:
- High throughput
- Durable buffering
- Replayability
- Backpressure handling
👉 面试回答
Ingestion service 应该做轻量 validation 和 enrichment, 然后将 events 写入 Kafka 这类 durable log。
事件持久写入后, collector 就可以返回成功。
重处理应该异步完成。
8️⃣ Deduplication and Idempotency
为什么需要?
Duplicate events 可能来自:
- Client retries
- Network timeout
- SDK resend
- Collector retry
- Stream processor retry
Deduplication Key
使用:
eventId
或者:
source + eventId
Dedup Strategy
- Short-term dedupe cache
- Critical events 使用 persistent dedupe table
- Idempotent downstream writes
- 支持时使用 exactly-once stream processing
👉 面试回答
Event system 应该假设 duplicates 一定会发生。
每个 event 都应该包含唯一 event ID。
Pipeline 可以使用 dedupe cache、idempotent writes 和 event IDs 来避免 critical events 被重复计数。
9️⃣ Ordering and Late Events
Ordering Problem
Events 可能乱序到达。
示例:
purchase event arrives before add_to_cart
Timestamps
同时存储:
event_time = when event happened on client
received_at = when server received it
Late Event Handling
使用:
- Watermarks
- Allowed lateness window
- Reprocessing jobs
- Backfill correction
👉 面试回答
Event ordering 不能保证, 尤其是 mobile clients。
我会同时存储 client event time 和 server received time。
对 analytics 来说, stream processors 可以使用 watermarks 和 allowed lateness windows 来处理延迟 events。
🔟 Sessionization
什么是 Session?
Session 是用户在一个活动窗口内的一组 events。
常见规则:
new session if user inactive for 30 minutes
Sessionization Flow
User events
→ Group by user / anonymousId
→ Sort by event time
→ Split by inactivity gap
→ Assign sessionId
Uses
- Session length
- Funnel analysis
- Conversion analysis
- Retention metrics
- Attribution
👉 面试回答
Sessionization 会将 events 归组为用户 sessions。
常见规则是用户 30 分钟无活动后开启新 session。
这对 funnels、retention、attribution 和用户行为分析很有用。
1️⃣1️⃣ Stream Processing
Real-time Use Cases
- Live dashboards
- Real-time alerts
- Experiment metrics
- Fraud detection
- Real-time personalization
- Operational monitoring
Flow
Kafka topic
→ Stream processor
→ Aggregate by time window
→ Write to real-time analytics store
Examples
page_views_per_minute
checkout_conversion_rate
purchase_count_by_region
👉 面试回答
Stream processing 支持 near-real-time analytics。
它从 durable log 消费 events, 做 deduplication 和 aggregation, 并将 metrics 写入 real-time analytics store。
1️⃣2️⃣ Batch Processing
Batch Use Cases
- Daily reporting
- Long-term analytics
- Attribution
- Model training
- Data quality checks
- Backfills
Flow
Raw events in object storage
→ Batch job
→ Clean and transform
→ Write analytics tables
→ BI / dashboard / ML training
为什么保留 Raw Events?
- Replay pipeline
- 修复 bug
- Backfill new metrics
- Audit data
- Train models
👉 面试回答
Raw events 应该被持久化保存, 通常放在 object storage。
这样可以 replay、backfill、data correction 和 offline analytics。
Batch processing 会生成可靠的 analytics tables 用于 reporting。
1️⃣3️⃣ Storage Design
Raw Event Storage
使用:
S3 / object storage
按以下方式分区:
date / hour / eventName / source
Real-time Analytics Store
使用:
ClickHouse / Druid / Pinot / Elasticsearch
适合:
- Fast aggregates
- Dashboards
- Time-range queries
Data Warehouse
使用:
BigQuery / Snowflake / Hive / Redshift
适合:
- Historical analytics
- Ad hoc SQL
- Model training
👉 面试回答
我会使用多种 storage systems。
Raw events 进入 object storage, 用于 durability 和 replay。
Real-time aggregates 写入低延迟 analytics store。
Curated historical tables 写入 data warehouse, 用于 reporting 和 analysis。
1️⃣4️⃣ Analytics Use Cases
Funnels
示例:
view_product
→ add_to_cart
→ checkout_started
→ purchase_completed
Retention
示例:
Day 1 / Day 7 / Day 30 retention
Attribution
示例:
Which campaign caused purchase?
Experimentation
示例:
Variant A vs Variant B conversion rate
👉 面试回答
Event tracking 支持 funnels、retention、 attribution、experimentation、personalization 和 product analytics。
数据模型应该方便 join user events、sessions、 experiments 和 business outcomes。
1️⃣5️⃣ Privacy and Compliance
Sensitive Data Risks
Events 可能包含:
- User IDs
- IP addresses
- Device IDs
- Location
- Payment-related metadata
- Health or financial signals
Requirements
- Consent management
- Data minimization
- PII redaction
- Encryption
- Access control
- Audit logs
- User deletion support
- Retention policy
Important Rule
不要允许任意 PII 写入 event properties。
👉 面试回答
Event tracking 很容易收集敏感数据。
我会强制 consent checks、data minimization、 PII redaction、encryption、access control、 retention limits 和 deletion support。
Event schemas 应该防止任意 PII 进入 properties。
1️⃣6️⃣ Data Quality
Common Issues
- Missing required fields
- Wrong property type
- Duplicate events
- Late events
- Inconsistent event names
- Client clock skew
- Bot traffic
- Version mismatch
Strategies
- Schema validation
- Data contracts
- Event versioning
- Dead-letter queue
- Data quality dashboards
- Bot filtering
- Sampling audits
- Alert on metric anomalies
👉 面试回答
Data quality 是 event tracking 最难的问题之一。
我会强制 schemas、version events、 将 invalid events 路由到 dead-letter queue, 并监控 missing fields、duplicates、 late events 和 schema violations 等数据质量指标。
1️⃣7️⃣ Scaling Patterns
Pattern 1: Batch Events at Client
减少网络开销。
Pattern 2: Durable Event Log
Kafka / Pulsar / Kinesis 吸收峰值并支持 replay。
Pattern 3: Separate Hot and Cold Paths
- Hot path = real-time metrics
- Cold path = batch analytics
Pattern 4: Partition by Time and Event Type
提升 storage 和 query efficiency。
Pattern 5: Schema Registry
防止破坏 pipeline 的 events。
👉 面试回答
为了扩展 event tracking, 我会在 client 批量发送 events, 将 events 写入 durable log, 分离 real-time 和 batch processing, 按 time 和 event type 对 raw storage 分区, 并使用 schema registry 强制 schemas。
1️⃣8️⃣ Failure Handling
Common Failures
- Client offline
- Duplicate event send
- Collector overload
- Kafka lag
- Stream processor failure
- Bad event schema
- Data warehouse load failure
- Late event backfill
Strategies
- Client buffering
- Retry with backoff
- Rate limiting
- Durable queue
- Dead-letter queue
- Replay from raw events
- Backfill jobs
- Idempotent processing
👉 面试回答
系统应该假设 failures 和 duplicates 一定会发生。
Clients buffer 并 retry events。 Collectors 将 events 写入 durable log。 Invalid events 进入 dead-letter queue。 Raw events 持久化保存, 因此 downstream pipelines 可以 replay 或 backfill。
1️⃣9️⃣ Consistency Model
需要较强一致性的场景
- Consent and privacy enforcement
- User deletion requests
- Billing-related events
- Experiment assignment logs
- Critical conversion events
可以最终一致的场景
- Dashboards
- Funnels
- Retention reports
- Attribution reports
- Personalization features
- Aggregated analytics
👉 面试回答
大多数 analytics 可以最终一致。
Dashboard 延迟几分钟通常可以接受。
但 consent、deletion requests、experiment assignment 和 critical billing / conversion events 需要更强正确性。
2️⃣0️⃣ Observability
Key Metrics
- Event ingestion QPS
- Collector latency
- Collector error rate
- Queue lag
- Duplicate event rate
- Invalid event rate
- Late event rate
- Stream processing lag
- Raw storage write failure
- Warehouse load latency
- Data freshness
👉 面试回答
我会监控 ingestion QPS、collector latency、 queue lag、invalid event rate、duplicate rate、 late event rate、stream processing lag、 warehouse load status 和 data freshness。
这些指标可以说明 analytics pipeline 是否健康且可信。
2️⃣1️⃣ End-to-End Flow
Event Ingestion Flow
User clicks button
→ SDK creates event with eventId
→ SDK buffers and batches events
→ Collector validates schema
→ Event written to Kafka
→ Raw event stored in object storage
Real-time Analytics Flow
Kafka event stream
→ Stream processor deduplicates
→ Aggregate by time window
→ Write real-time metrics
→ Dashboard updates
Batch Analytics Flow
Raw events in object storage
→ Batch job cleans and transforms
→ Sessionization / attribution / funnels
→ Write analytics tables
→ Reports and ML training
Key Insight
Event Tracking System 不是简单记录 clicks, 而是 analytics、experiments、attribution 和 product decisions 的数据基础设施。
🧠 Staff-Level Answer(最终版)
👉 面试回答(完整背诵版)
在设计 Event Tracking System 时, 我会把它看作一个高吞吐 analytics data pipeline。
Events 由 web clients、mobile clients 和 backend services 产生。
SDKs 应该生成 event IDs、附加 context、 buffer events、批量上传、retry failures, 并尊重 user consent。
Ingestion service 会校验 events, 添加 received timestamp 和 context, 然后写入 Kafka 这类 durable event log。
Raw events 应该持久化存储在 object storage 中, 这样系统可以 replay、backfill、audit 和重建 downstream tables。
Stream processing 支持 near-real-time dashboards、 alerts、experiments 和 personalization。
Batch processing 支持可靠的 historical analytics、 attribution、sessionization、retention 和 model training。
Data quality 非常关键。 我会使用 schema registry、event versioning、 dead-letter queues、deduplication 和 data quality dashboards。
因为 events 可能重复或迟到, 每个 event 都应该有唯一 event ID, 系统也应该同时保存 event time 和 received time。
Privacy 也非常关键。 系统必须强制 consent、data minimization、 PII redaction、encryption、access control、 retention 和 deletion policies。
核心权衡包括 accuracy、latency、cost、 privacy 和 processing complexity。
最终目标是提供可靠、可信、可扩展的用户行为数据, 支撑 analytics、experimentation、attribution、 personalization 和 business decisions。
⭐ Final Insight
Event Tracking System 的核心不是简单记录点击, 而是支撑 analytics、A/B testing、attribution、personalization 和业务决策的数据基础设施。
Implement