d&d-t System Design Deep Dive ·

🎯 Design Event Tracking System

1️⃣ Core Framework

When discussing Event Tracking System design, I frame it as:

Event generation and SDK design
Event schema and validation
Ingestion pipeline
Stream processing and batch processing
Storage for raw events and analytics
Deduplication, ordering, and sessionization
Privacy, consent, and compliance
Trade-offs: accuracy vs latency vs cost

2️⃣ Core Requirements

Functional Requirements

Track user events:
- page view
- click
- impression
- purchase
- signup
- search
Support web, mobile, and backend events
Support event schema validation
Support real-time analytics
Support batch analytics
Support dashboards and funnels
Support attribution
Support data export

Non-functional Requirements

High write throughput
Low ingestion latency
Durable raw event storage
Scalable processing pipeline
Exactly-once effect where possible
Eventual consistency acceptable for analytics
Strong privacy and access control
Cost-efficient long-term storage

👉 Interview Answer

An event tracking system collects user and system events, validates them, stores raw events durably, processes them into analytics tables, and powers dashboards, funnels, attribution, and experimentation.

The main challenges are high write throughput, data quality, deduplication, privacy, and cost control.

3️⃣ Main APIs

Track Event

POST /api/events

Request:

{
  "eventId": "evt_123",
  "userId": "u456",
  "anonymousId": "anon_789",
  "eventName": "product_clicked",
  "timestamp": "2026-05-03T10:00:00Z",
  "properties": {
    "productId": "p123",
    "category": "shoes",
    "surface": "home"
  },
  "context": {
    "device": "mobile",
    "appVersion": "1.2.3",
    "ip": "1.2.3.4"
  }
}

Batch Track Events

POST /api/events/batch

Query Analytics

GET /api/analytics/funnel?name=checkout

Export Events

POST /api/events/export

👉 Interview Answer

The most important API is the track event API.

Clients should also support batch uploads, because mobile and web SDKs may buffer events and send them in batches to reduce network overhead.

Analytics queries should be served from processed tables, not raw ingestion APIs.

4️⃣ Event Schema

Common Event Fields

{
  "eventId": "unique event id",
  "eventName": "product_clicked",
  "userId": "known user id",
  "anonymousId": "anonymous device/session id",
  "timestamp": "client event time",
  "receivedAt": "server received time",
  "properties": {},
  "context": {}
}

Why Schema Matters

Prevent bad data
Support analytics consistency
Improve query reliability
Enable governance
Reduce downstream pipeline failures

Schema Registry

Use a schema registry to define:

Required fields
Allowed event names
Property types
Versioning
Deprecation rules

👉 Interview Answer

Event schema is critical for data quality.

I would use a schema registry to define allowed event names, required fields, property types, and schema versions.

Bad events should be rejected, quarantined, or routed to a dead-letter queue.

5️⃣ High-Level Architecture

Client SDK / Backend Service
→ Event Collector
→ Validation / Enrichment
→ Message Queue / Event Log
→ Stream Processing
→ Raw Event Storage
→ Analytics Warehouse
→ Dashboard / Experiment / Attribution

Main Components

Client SDK

Generates event ID
Buffers events
Retries failed sends
Handles batching

Event Collector

Receives events
Authenticates clients
Rate limits traffic
Validates payloads

Stream Processor

Deduplicates events
Enriches events
Computes real-time metrics
Writes to downstream stores

Data Warehouse

Stores processed analytics data
Supports dashboards and ad hoc queries

👉 Interview Answer

I would design event tracking as a high-throughput data pipeline.

SDKs collect and batch events, collectors validate and enrich events, queues provide durability, stream processors compute real-time outputs, and raw events are stored for replay and offline analytics.

6️⃣ Client SDK Design

Responsibilities

Generate event IDs
Attach user/session/device context
Buffer events locally
Batch send events
Retry with backoff
Respect user consent
Drop events when storage is full

Mobile Considerations

App may go offline
Battery usage matters
Network usage matters
Events may arrive late

Web Considerations

Page may close before events are sent
Use beacon API when possible
Avoid blocking page navigation

👉 Interview Answer

Client SDK design is important because events are generated at the edge.

The SDK should buffer events, batch uploads, retry failures, attach context, and respect user consent.

It should never block the user experience.

7️⃣ Ingestion Pipeline

Basic Flow

Event received
→ Authenticate source
→ Validate schema
→ Add received timestamp
→ Enrich context
→ Write to durable event log
→ Return success

Event Collector Responsibilities

Authentication
Rate limiting
Schema validation
Payload size validation
Timestamp normalization
IP / geo enrichment
User-agent parsing
Consent filtering

Durable Queue

Common choice:

Kafka / Pulsar / Kinesis

Why?

High throughput
Durable buffering
Replayability
Backpressure handling

👉 Interview Answer

The ingestion service should do lightweight validation and enrichment, then write events to a durable log such as Kafka.

Once the event is durably written, the collector can acknowledge success.

Heavy processing should happen asynchronously.

8️⃣ Deduplication and Idempotency

Why Needed?

Duplicate events can happen due to:

Client retries
Network timeout
SDK resend
Collector retry
Stream processor retry

Deduplication Key

Use:

eventId

or:

source + eventId

Dedup Strategy

Short-term dedupe cache
Persistent dedupe table for critical events
Idempotent downstream writes
Exactly-once stream processing where supported

👉 Interview Answer

Event systems should assume duplicates will happen.

Every event should include a unique event ID.

The pipeline can use dedupe caches, idempotent writes, and event IDs to avoid double-counting important events.

9️⃣ Ordering and Late Events

Ordering Problem

Events may arrive out of order.

Example:

purchase event arrives before add_to_cart

Timestamps

Store both:

event_time = when event happened on client
received_at = when server received it

Late Event Handling

Use:

Watermarks
Allowed lateness window
Reprocessing jobs
Backfill correction

👉 Interview Answer

Event ordering is not guaranteed, especially with mobile clients.

I would store both client event time and server received time.

For analytics, stream processors can use watermarks and allowed lateness windows to handle delayed events.

🔟 Sessionization

What Is a Session?

A group of user events within an activity window.

Example rule:

new session if user inactive for 30 minutes

Sessionization Flow

User events
→ Group by user / anonymousId
→ Sort by event time
→ Split by inactivity gap
→ Assign sessionId

Uses

Session length
Funnel analysis
Conversion analysis
Retention metrics
Attribution

👉 Interview Answer

Sessionization groups events into user sessions.

A common rule is to start a new session after 30 minutes of inactivity.

This is useful for funnels, retention, attribution, and user behavior analysis.

1️⃣1️⃣ Stream Processing

Real-time Use Cases

Live dashboards
Real-time alerts
Experiment metrics
Fraud detection
Real-time personalization
Operational monitoring

Flow

Kafka topic
→ Stream processor
→ Aggregate by time window
→ Write to real-time analytics store

Examples

page_views_per_minute
checkout_conversion_rate
purchase_count_by_region

👉 Interview Answer

Stream processing powers near-real-time analytics.

It consumes events from the durable log, performs deduplication and aggregation, and writes metrics to a real-time analytics store.

1️⃣2️⃣ Batch Processing

Batch Use Cases

Daily reporting
Long-term analytics
Attribution
Model training
Data quality checks
Backfills

Flow

Raw events in object storage
→ Batch job
→ Clean and transform
→ Write analytics tables
→ Power BI / dashboard / ML training

Why Keep Raw Events?

Replay pipeline
Fix bugs
Backfill new metrics
Audit data
Train models

👉 Interview Answer

Raw events should be stored durably, usually in object storage.

This allows replay, backfill, data correction, and offline analytics.

Batch processing creates reliable analytics tables for reporting.

1️⃣3️⃣ Storage Design

Raw Event Storage

Use:

S3 / object storage

Partition by:

date / hour / eventName / source

Real-time Analytics Store

Use:

ClickHouse / Druid / Pinot / Elasticsearch

Good for:

Fast aggregates
Dashboards
Time-range queries

Data Warehouse

Use:

BigQuery / Snowflake / Hive / Redshift

Good for:

Historical analytics
Ad hoc SQL
Model training

👉 Interview Answer

I would use multiple storage systems.

Raw events go to object storage for durability and replay.

Real-time aggregates go to a low-latency analytics store.

Curated historical tables go to a data warehouse for reporting and analysis.

1️⃣4️⃣ Analytics Use Cases

Funnels

Example:

view_product
→ add_to_cart
→ checkout_started
→ purchase_completed

Retention

Example:

Day 1 / Day 7 / Day 30 retention

Attribution

Example:

Which campaign caused purchase?

Experimentation

Example:

Variant A vs Variant B conversion rate

👉 Interview Answer

Event tracking powers funnels, retention, attribution, experimentation, personalization, and product analytics.

The data model should make it easy to join user events, sessions, experiments, and business outcomes.

1️⃣5️⃣ Privacy and Compliance

Sensitive Data Risks

Events may contain:

User IDs
IP addresses
Device IDs
Location
Email
Payment-related metadata
Health or financial signals

Requirements

Consent management
Data minimization
PII redaction
Encryption
Access control
Audit logs
User deletion support
Retention policy

Important Rule

Do not allow arbitrary PII in event properties.

👉 Interview Answer

Event tracking can easily collect sensitive data.

I would enforce consent checks, data minimization, PII redaction, encryption, access control, retention limits, and deletion support.

Event schemas should prevent arbitrary PII from entering properties.

1️⃣6️⃣ Data Quality

Common Issues

Missing required fields
Wrong property type
Duplicate events
Late events
Inconsistent event names
Client clock skew
Bot traffic
Version mismatch

Strategies

Schema validation
Data contracts
Event versioning
Dead-letter queue
Data quality dashboards
Bot filtering
Sampling audits
Alert on metric anomalies

👉 Interview Answer

Data quality is one of the hardest parts of event tracking.

I would enforce schemas, version events, route invalid events to a dead-letter queue, and monitor data quality metrics such as missing fields, duplicates, late events, and schema violations.

1️⃣7️⃣ Scaling Patterns

Pattern 1: Batch Events at Client

Reduce network overhead.

Pattern 2: Durable Event Log

Kafka/Pulsar/Kinesis absorbs spikes and enables replay.

Pattern 3: Separate Hot and Cold Paths

Hot path = real-time metrics
Cold path = batch analytics

Pattern 4: Partition by Time and Event Type

Improves storage and query efficiency.

Pattern 5: Schema Registry

Prevents pipeline-breaking events.

👉 Interview Answer

To scale event tracking, I would batch events at the client, write events to a durable log, separate real-time and batch processing, partition raw storage by time and event type, and enforce schemas with a registry.

1️⃣8️⃣ Failure Handling

Common Failures

Client offline
Duplicate event send
Collector overload
Kafka lag
Stream processor failure
Bad event schema
Data warehouse load failure
Late event backfill

Strategies

Client buffering
Retry with backoff
Rate limiting
Durable queue
Dead-letter queue
Replay from raw events
Backfill jobs
Idempotent processing

👉 Interview Answer

The system should assume failures and duplicates.

Clients buffer and retry events. Collectors write to a durable log. Invalid events go to a dead-letter queue. Raw events are stored durably so downstream pipelines can be replayed or backfilled.

1️⃣9️⃣ Consistency Model

Stronger Consistency Needed For

Consent and privacy enforcement
User deletion requests
Billing-related events
Experiment assignment logs
Critical conversion events

Eventual Consistency Acceptable For

Dashboards
Funnels
Retention reports
Attribution reports
Personalization features
Aggregated analytics

👉 Interview Answer

Most analytics can be eventually consistent.

A dashboard being delayed by a few minutes is usually acceptable.

But consent, deletion requests, experiment assignment, and critical billing or conversion events require stronger correctness.

2️⃣0️⃣ Observability

Key Metrics

Event ingestion QPS
Collector latency
Collector error rate
Queue lag
Duplicate event rate
Invalid event rate
Late event rate
Stream processing lag
Raw storage write failure
Warehouse load latency
Data freshness

👉 Interview Answer

I would monitor ingestion QPS, collector latency, queue lag, invalid event rate, duplicate rate, late event rate, stream processing lag, warehouse load status, and data freshness.

These metrics show whether the analytics pipeline is healthy and trustworthy.

2️⃣1️⃣ End-to-End Flow

Event Ingestion Flow

User clicks button
→ SDK creates event with eventId
→ SDK buffers and batches events
→ Collector validates schema
→ Event written to Kafka
→ Raw event stored in object storage

Real-time Analytics Flow

Kafka event stream
→ Stream processor deduplicates
→ Aggregate by time window
→ Write real-time metrics
→ Dashboard updates

Batch Analytics Flow

Raw events in object storage
→ Batch job cleans and transforms
→ Sessionization / attribution / funnels
→ Write analytics tables
→ Reports and ML training

Key Insight

Event Tracking System is not just logging clicks — it is the data foundation for analytics, experiments, attribution, and product decisions.

🧠 Staff-Level Answer (Final)

👉 Interview Answer (Full Version)

When designing an event tracking system, I think of it as a high-throughput analytics data pipeline.

Events are generated by web clients, mobile clients, and backend services.

SDKs should generate event IDs, attach context, buffer events, batch uploads, retry failures, and respect user consent.

The ingestion service validates events, enriches them with received timestamp and context, then writes them to a durable event log such as Kafka.

Raw events should be stored durably in object storage so the system can replay, backfill, audit, and rebuild downstream tables.

Stream processing powers near-real-time dashboards, alerts, experiments, and personalization.

Batch processing powers reliable historical analytics, attribution, sessionization, retention, and model training.

Data quality is critical. I would use a schema registry, event versioning, dead-letter queues, deduplication, and data quality dashboards.

Because events may be duplicated or arrive late, every event should have a unique event ID, and the system should store both event time and received time.

Privacy is also critical. The system must enforce consent, data minimization, PII redaction, encryption, access control, retention, and deletion policies.

The main trade-offs are accuracy, latency, cost, privacy, and processing complexity.

Ultimately, the goal is to provide reliable, trusted, and scalable behavioral data for analytics, experimentation, attribution, personalization, and business decisions.

⭐ Final Insight

Event Tracking System 的核心不是简单记录点击，而是一个支撑 analytics、A/B testing、attribution、personalization 和业务决策的数据基础设施。

中文部分

🎯 Design Event Tracking System

1️⃣ 核心框架

在设计 Event Tracking System 时，我通常从以下几个方面分析：

Event generation 和 SDK design
Event schema 和 validation
Ingestion pipeline
Stream processing 和 batch processing
Raw events 和 analytics storage
Deduplication、ordering 和 sessionization
Privacy、consent 和 compliance
核心权衡：accuracy vs latency vs cost

2️⃣ 核心需求

功能需求

追踪用户事件：
- page view
- click
- impression
- purchase
- signup
- search
支持 web、mobile 和 backend events
支持 event schema validation
支持 real-time analytics
支持 batch analytics
支持 dashboards 和 funnels
支持 attribution
支持 data export

非功能需求

高写入吞吐
低 ingestion 延迟
Raw event 持久化存储
可扩展 processing pipeline
尽可能实现 exactly-once effect
Analytics 可以最终一致
强 privacy 和 access control
长期存储成本可控

👉 面试回答

Event Tracking System 收集用户和系统事件，校验事件，持久化保存 raw events，将它们处理成 analytics tables，并支持 dashboards、funnels、attribution 和 experimentation。

核心挑战包括高写入吞吐、数据质量、去重、隐私和成本控制。

3️⃣ 主要 API

Track Event

POST /api/events

Request:

{
  "eventId": "evt_123",
  "userId": "u456",
  "anonymousId": "anon_789",
  "eventName": "product_clicked",
  "timestamp": "2026-05-03T10:00:00Z",
  "properties": {
    "productId": "p123",
    "category": "shoes",
    "surface": "home"
  },
  "context": {
    "device": "mobile",
    "appVersion": "1.2.3",
    "ip": "1.2.3.4"
  }
}

Batch Track Events

POST /api/events/batch

Query Analytics

GET /api/analytics/funnel?name=checkout

Export Events

POST /api/events/export

👉 面试回答

最重要的 API 是 track event API。

Clients 也应该支持 batch uploads，因为 mobile 和 web SDKs 可能会 buffer events，并批量发送以减少网络开销。

Analytics queries 应该从 processed tables 查询，而不是直接查询 ingestion API。

4️⃣ Event Schema

Common Event Fields

{
  "eventId": "unique event id",
  "eventName": "product_clicked",
  "userId": "known user id",
  "anonymousId": "anonymous device/session id",
  "timestamp": "client event time",
  "receivedAt": "server received time",
  "properties": {},
  "context": {}
}

为什么 Schema 重要？

防止坏数据
保证 analytics 一致性
提升 query reliability
支持 data governance
减少 downstream pipeline failure

Schema Registry

Schema registry 定义：

Required fields
Allowed event names
Property types
Versioning
Deprecation rules

👉 面试回答

Event schema 对数据质量非常关键。

我会使用 schema registry 定义允许的 event names、required fields、 property types 和 schema versions。

Bad events 应该被 reject、quarantine 或进入 dead-letter queue。

5️⃣ High-Level Architecture

Client SDK / Backend Service
→ Event Collector
→ Validation / Enrichment
→ Message Queue / Event Log
→ Stream Processing
→ Raw Event Storage
→ Analytics Warehouse
→ Dashboard / Experiment / Attribution

Main Components

Client SDK

生成 event ID
Buffer events
Retry failed sends
支持 batching

Event Collector

接收 events
验证 clients
Rate limit traffic
Validate payloads

Stream Processor

Deduplicate events
Enrich events
计算 real-time metrics
写入 downstream stores

Data Warehouse

存储 processed analytics data
支持 dashboards 和 ad hoc queries

👉 面试回答

我会将 event tracking 设计成高吞吐数据 pipeline。

SDKs 收集并批量发送 events； collectors 校验和 enrich events； queues 提供 durability； stream processors 计算 real-time outputs； raw events 被存储用于 replay 和 offline analytics。

6️⃣ Client SDK Design

Responsibilities

Generate event IDs
Attach user / session / device context
Buffer events locally
Batch send events
Retry with backoff
Respect user consent
Drop events when storage is full

Mobile Considerations

App may go offline
Battery usage matters
Network usage matters
Events may arrive late

Web Considerations

Page may close before events are sent
Use beacon API when possible
Avoid blocking page navigation

👉 面试回答

Client SDK design 很重要，因为 events 是在 edge 产生的。

SDK 应该 buffer events、batch uploads、 retry failures、attach context，并尊重 user consent。

它绝不能阻塞用户体验。

7️⃣ Ingestion Pipeline

Basic Flow

Event received
→ Authenticate source
→ Validate schema
→ Add received timestamp
→ Enrich context
→ Write to durable event log
→ Return success

Event Collector Responsibilities

Authentication
Rate limiting
Schema validation
Payload size validation
Timestamp normalization
IP / geo enrichment
User-agent parsing
Consent filtering

Durable Queue

常见选择：

Kafka / Pulsar / Kinesis

原因：

High throughput
Durable buffering
Replayability
Backpressure handling

👉 面试回答

Ingestion service 应该做轻量 validation 和 enrichment，然后将 events 写入 Kafka 这类 durable log。

事件持久写入后， collector 就可以返回成功。

重处理应该异步完成。

8️⃣ Deduplication and Idempotency

为什么需要？

Duplicate events 可能来自：

Client retries
Network timeout
SDK resend
Collector retry
Stream processor retry

Deduplication Key

使用：

eventId

或者：

source + eventId

Dedup Strategy

Short-term dedupe cache
Critical events 使用 persistent dedupe table
Idempotent downstream writes
支持时使用 exactly-once stream processing

👉 面试回答

Event system 应该假设 duplicates 一定会发生。

每个 event 都应该包含唯一 event ID。

Pipeline 可以使用 dedupe cache、idempotent writes 和 event IDs 来避免 critical events 被重复计数。

9️⃣ Ordering and Late Events

Ordering Problem

Events 可能乱序到达。

示例：

purchase event arrives before add_to_cart

Timestamps

同时存储：

event_time = when event happened on client
received_at = when server received it

Late Event Handling

使用：

Watermarks
Allowed lateness window
Reprocessing jobs
Backfill correction

👉 面试回答

Event ordering 不能保证，尤其是 mobile clients。

我会同时存储 client event time 和 server received time。

对 analytics 来说， stream processors 可以使用 watermarks 和 allowed lateness windows 来处理延迟 events。

🔟 Sessionization

什么是 Session？

Session 是用户在一个活动窗口内的一组 events。

常见规则：

new session if user inactive for 30 minutes

Sessionization Flow

User events
→ Group by user / anonymousId
→ Sort by event time
→ Split by inactivity gap
→ Assign sessionId

Uses

Session length
Funnel analysis
Conversion analysis
Retention metrics
Attribution

👉 面试回答

Sessionization 会将 events 归组为用户 sessions。

常见规则是用户 30 分钟无活动后开启新 session。

这对 funnels、retention、attribution 和用户行为分析很有用。

1️⃣1️⃣ Stream Processing

Real-time Use Cases

Live dashboards
Real-time alerts
Experiment metrics
Fraud detection
Real-time personalization
Operational monitoring

Flow

Kafka topic
→ Stream processor
→ Aggregate by time window
→ Write to real-time analytics store

Examples

page_views_per_minute
checkout_conversion_rate
purchase_count_by_region

👉 面试回答

Stream processing 支持 near-real-time analytics。

它从 durable log 消费 events，做 deduplication 和 aggregation，并将 metrics 写入 real-time analytics store。

1️⃣2️⃣ Batch Processing

Batch Use Cases

Daily reporting
Long-term analytics
Attribution
Model training
Data quality checks
Backfills

Flow

Raw events in object storage
→ Batch job
→ Clean and transform
→ Write analytics tables
→ BI / dashboard / ML training

为什么保留 Raw Events？

Replay pipeline
修复 bug
Backfill new metrics
Audit data
Train models

👉 面试回答

Raw events 应该被持久化保存，通常放在 object storage。

这样可以 replay、backfill、data correction 和 offline analytics。

Batch processing 会生成可靠的 analytics tables 用于 reporting。

1️⃣3️⃣ Storage Design

Raw Event Storage

使用：

S3 / object storage

按以下方式分区：

date / hour / eventName / source

Real-time Analytics Store

使用：

ClickHouse / Druid / Pinot / Elasticsearch

适合：

Fast aggregates
Dashboards
Time-range queries

Data Warehouse

使用：

BigQuery / Snowflake / Hive / Redshift

适合：

Historical analytics
Ad hoc SQL
Model training

👉 面试回答

我会使用多种 storage systems。

Raw events 进入 object storage，用于 durability 和 replay。

Real-time aggregates 写入低延迟 analytics store。

Curated historical tables 写入 data warehouse，用于 reporting 和 analysis。

1️⃣4️⃣ Analytics Use Cases

Funnels

示例：

view_product
→ add_to_cart
→ checkout_started
→ purchase_completed

Retention

示例：

Day 1 / Day 7 / Day 30 retention

Attribution

示例：

Which campaign caused purchase?

Experimentation

示例：

Variant A vs Variant B conversion rate

👉 面试回答

Event tracking 支持 funnels、retention、 attribution、experimentation、personalization 和 product analytics。

数据模型应该方便 join user events、sessions、 experiments 和 business outcomes。

1️⃣5️⃣ Privacy and Compliance

Sensitive Data Risks

Events 可能包含：

User IDs
IP addresses
Device IDs
Location
Email
Payment-related metadata
Health or financial signals

Requirements

Consent management
Data minimization
PII redaction
Encryption
Access control
Audit logs
User deletion support
Retention policy

Important Rule

不要允许任意 PII 写入 event properties。

👉 面试回答

Event tracking 很容易收集敏感数据。

我会强制 consent checks、data minimization、 PII redaction、encryption、access control、 retention limits 和 deletion support。

Event schemas 应该防止任意 PII 进入 properties。

1️⃣6️⃣ Data Quality

Common Issues

Missing required fields
Wrong property type
Duplicate events
Late events
Inconsistent event names
Client clock skew
Bot traffic
Version mismatch

Strategies

Schema validation
Data contracts
Event versioning
Dead-letter queue
Data quality dashboards
Bot filtering
Sampling audits
Alert on metric anomalies

👉 面试回答

Data quality 是 event tracking 最难的问题之一。

我会强制 schemas、version events、将 invalid events 路由到 dead-letter queue，并监控 missing fields、duplicates、 late events 和 schema violations 等数据质量指标。

1️⃣7️⃣ Scaling Patterns

Pattern 1: Batch Events at Client

减少网络开销。

Pattern 2: Durable Event Log

Kafka / Pulsar / Kinesis 吸收峰值并支持 replay。

Pattern 3: Separate Hot and Cold Paths

Hot path = real-time metrics
Cold path = batch analytics

Pattern 4: Partition by Time and Event Type

提升 storage 和 query efficiency。

Pattern 5: Schema Registry

防止破坏 pipeline 的 events。

👉 面试回答

为了扩展 event tracking，我会在 client 批量发送 events，将 events 写入 durable log，分离 real-time 和 batch processing，按 time 和 event type 对 raw storage 分区，并使用 schema registry 强制 schemas。

1️⃣8️⃣ Failure Handling

Common Failures

Client offline
Duplicate event send
Collector overload
Kafka lag
Stream processor failure
Bad event schema
Data warehouse load failure
Late event backfill

Strategies

Client buffering
Retry with backoff
Rate limiting
Durable queue
Dead-letter queue
Replay from raw events
Backfill jobs
Idempotent processing

👉 面试回答

系统应该假设 failures 和 duplicates 一定会发生。

Clients buffer 并 retry events。 Collectors 将 events 写入 durable log。 Invalid events 进入 dead-letter queue。 Raw events 持久化保存，因此 downstream pipelines 可以 replay 或 backfill。

1️⃣9️⃣ Consistency Model

需要较强一致性的场景

Consent and privacy enforcement
User deletion requests
Billing-related events
Experiment assignment logs
Critical conversion events

可以最终一致的场景

Dashboards
Funnels
Retention reports
Attribution reports
Personalization features
Aggregated analytics

👉 面试回答

大多数 analytics 可以最终一致。

Dashboard 延迟几分钟通常可以接受。

但 consent、deletion requests、experiment assignment 和 critical billing / conversion events 需要更强正确性。

2️⃣0️⃣ Observability

Key Metrics

Event ingestion QPS
Collector latency
Collector error rate
Queue lag
Duplicate event rate
Invalid event rate
Late event rate
Stream processing lag
Raw storage write failure
Warehouse load latency
Data freshness

👉 面试回答

我会监控 ingestion QPS、collector latency、 queue lag、invalid event rate、duplicate rate、 late event rate、stream processing lag、 warehouse load status 和 data freshness。

这些指标可以说明 analytics pipeline 是否健康且可信。

2️⃣1️⃣ End-to-End Flow

Event Ingestion Flow

User clicks button
→ SDK creates event with eventId
→ SDK buffers and batches events
→ Collector validates schema
→ Event written to Kafka
→ Raw event stored in object storage

Real-time Analytics Flow

Kafka event stream
→ Stream processor deduplicates
→ Aggregate by time window
→ Write real-time metrics
→ Dashboard updates

Batch Analytics Flow

Raw events in object storage
→ Batch job cleans and transforms
→ Sessionization / attribution / funnels
→ Write analytics tables
→ Reports and ML training

Key Insight

Event Tracking System 不是简单记录 clicks，而是 analytics、experiments、attribution 和 product decisions 的数据基础设施。

🧠 Staff-Level Answer（最终版）

👉 面试回答（完整背诵版）

在设计 Event Tracking System 时，我会把它看作一个高吞吐 analytics data pipeline。

Events 由 web clients、mobile clients 和 backend services 产生。

SDKs 应该生成 event IDs、附加 context、 buffer events、批量上传、retry failures，并尊重 user consent。

Ingestion service 会校验 events，添加 received timestamp 和 context，然后写入 Kafka 这类 durable event log。

Raw events 应该持久化存储在 object storage 中，这样系统可以 replay、backfill、audit 和重建 downstream tables。

Stream processing 支持 near-real-time dashboards、 alerts、experiments 和 personalization。

Batch processing 支持可靠的 historical analytics、 attribution、sessionization、retention 和 model training。

Data quality 非常关键。我会使用 schema registry、event versioning、 dead-letter queues、deduplication 和 data quality dashboards。

因为 events 可能重复或迟到，每个 event 都应该有唯一 event ID，系统也应该同时保存 event time 和 received time。

Privacy 也非常关键。系统必须强制 consent、data minimization、 PII redaction、encryption、access control、 retention 和 deletion policies。

核心权衡包括 accuracy、latency、cost、 privacy 和 processing complexity。

最终目标是提供可靠、可信、可扩展的用户行为数据，支撑 analytics、experimentation、attribution、 personalization 和 business decisions。

⭐ Final Insight

Event Tracking System 的核心不是简单记录点击，而是支撑 analytics、A/B testing、attribution、personalization 和业务决策的数据基础设施。