d&d-t System Design Deep Dive ·

🎯 Design Feature Flag System

1️⃣ Core Framework

When discussing Feature Flag System design, I frame it as:

Feature flag data model
Targeting and rollout rules
Flag evaluation engine
SDK and caching strategy
Control plane and data plane separation
Audit, approval, and governance
Kill switch and incident rollback
Trade-offs: consistency vs latency vs safety

2️⃣ Core Requirements

Functional Requirements

Create, update, delete feature flags
Enable / disable features by environment
Target by user, tenant, region, plan, device, app version
Support percentage rollout
Support A/B experiments
Support kill switch
Support audit logs
Support approval workflow
Support SDK evaluation

Non-functional Requirements

Low-latency flag evaluation
High availability
Safe rollout
Strong auditability
Prevent accidental global rollout
Support real-time or near-real-time propagation
SDK should work even if flag service is unavailable

👉 Interview Answer

A feature flag system controls feature behavior at runtime without redeploying code.

The main challenge is safely evaluating flags with low latency, supporting targeting and gradual rollout, while maintaining auditability, consistency, and fast rollback.

3️⃣ Core Concepts

Feature Flag

A flag controls whether a feature is enabled.

Example:

new_checkout_enabled = true

Environment

Flags usually differ by environment:

dev
qa
staging
production

Targeting Rule

Example:

enable for enterprise tenants in US region

Percentage Rollout

Example:

enable for 10% of users

Kill Switch

A flag used to immediately disable risky behavior.

👉 Interview Answer

I would model feature flags as runtime configuration.

Each flag has environments, targeting rules, rollout percentage, default value, owner, and audit history.

The evaluation engine decides the final value for each request context.

4️⃣ Main APIs

Create Flag

POST /api/flags

Request:

{
  "key": "new_checkout_enabled",
  "description": "Enable new checkout experience",
  "owner": "checkout-team",
  "defaultValue": false
}

Update Flag Rules

PATCH /api/flags/{flagKey}/rules

Evaluate Flag

POST /api/flags/evaluate

Request:

{
  "flagKey": "new_checkout_enabled",
  "context": {
    "userId": "u123",
    "tenantId": "t456",
    "region": "US",
    "plan": "enterprise",
    "appVersion": "2.5.0"
  }
}

Get All Flags for SDK

GET /api/sdk/flags?environment=production

👉 Interview Answer

I would expose management APIs for creating and updating flags, and evaluation APIs for runtime usage.

In production, most evaluation should happen locally inside SDKs using cached flag rules, not by calling the flag service on every request.

5️⃣ Data Model

Feature Flag Table

feature_flag (
  flag_key VARCHAR PRIMARY KEY,
  name VARCHAR,
  description TEXT,
  owner_team VARCHAR,
  default_value JSON,
  flag_type VARCHAR,
  status VARCHAR,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
)

Flag Environment Config Table

flag_environment_config (
  flag_key VARCHAR,
  environment VARCHAR,
  enabled BOOLEAN,
  rules JSON,
  rollout_percentage INT,
  version BIGINT,
  updated_at TIMESTAMP,
  PRIMARY KEY (flag_key, environment)
)

Flag Rule Table

flag_rule (
  rule_id VARCHAR PRIMARY KEY,
  flag_key VARCHAR,
  environment VARCHAR,
  priority INT,
  condition JSON,
  variation JSON,
  created_at TIMESTAMP
)

Audit Log Table

flag_audit_log (
  audit_id VARCHAR PRIMARY KEY,
  flag_key VARCHAR,
  actor_id VARCHAR,
  action VARCHAR,
  old_value JSON,
  new_value JSON,
  reason TEXT,
  created_at TIMESTAMP
)

Evaluation Event Table

flag_evaluation_event (
  event_id VARCHAR PRIMARY KEY,
  flag_key VARCHAR,
  environment VARCHAR,
  user_id VARCHAR,
  tenant_id VARCHAR,
  variation JSON,
  evaluated_at TIMESTAMP
)

👉 Interview Answer

I would separate flag metadata, environment-specific config, targeting rules, audit logs, and evaluation events.

Flag metadata changes rarely, but environment config and rollout rules are what SDKs need for runtime evaluation.

6️⃣ High-Level Architecture

Admin UI
→ Feature Flag Management API
→ Flag Config Store
→ Audit Log

Flag Config Publisher
→ CDN / Edge Cache / Streaming Channel
→ SDK Cache

Application
→ Feature Flag SDK
→ Local Evaluation Engine
→ Feature Enabled / Disabled

Main Components

Management Control Plane

Create flags
Update rules
Manage approvals
Record audit logs

Config Distribution Layer

Publishes flag changes
Supports cache / streaming updates
Serves SDK config

SDK

Caches flag config
Evaluates locally
Reports evaluation events
Falls back safely

Evaluation Engine

Applies targeting rules
Handles percentage rollout
Returns final variation

👉 Interview Answer

I would separate control plane from data plane.

The control plane manages flag configuration.

The data plane is the SDK and evaluation engine inside application services.

This avoids adding network latency to every request.

7️⃣ Flag Evaluation Flow

Evaluation Context

Example:

{
  "userId": "u123",
  "tenantId": "t456",
  "region": "US",
  "plan": "enterprise",
  "device": "ios",
  "appVersion": "2.5.0"
}

Evaluation Steps

Load flag config
→ Check environment enabled
→ Apply kill switch
→ Evaluate targeting rules in priority order
→ Apply percentage rollout
→ Return matching variation
→ Fall back to default if no rule matches

Example Rule

{
  "if": {
    "plan": "enterprise",
    "region": "US"
  },
  "then": true
}

👉 Interview Answer

Flag evaluation should be deterministic.

The engine takes flag config and request context, checks targeting rules, applies rollout logic, and returns the final variation.

If evaluation fails, the SDK should return a safe default.

8️⃣ Percentage Rollout

Goal

Gradually release feature to users.

Example:

1% → 5% → 25% → 50% → 100%

Stable Bucketing

Use deterministic hashing:

hash(flag_key + user_id) % 100

If result < rollout percentage:

enabled

Why Stable?

The same user should consistently get the same flag value.

👉 Interview Answer

For percentage rollout, I would use deterministic hashing based on flag key and user ID.

This ensures stable bucketing, so the same user consistently sees the same experience while rollout percentage changes gradually.

9️⃣ Targeting Rules

Common Targeting Dimensions

userId
tenantId
plan
region
country
device
app version
browser
user segment
account age
beta user group

Rule Priority

Rules should be evaluated in deterministic order.

Example:

Disable for blocked tenants
Enable for beta users
Enable 10% rollout
Default false

👉 Interview Answer

Targeting rules should be evaluated in priority order.

This allows explicit allowlists or blocklists to override percentage rollout.

Rule ordering must be clear and auditable.

🔟 SDK and Caching

Why SDK?

Applications need fast flag evaluation.

Calling remote service on every request causes:

Higher latency
Higher failure risk
More load on flag service

SDK Responsibilities

Download flag config
Cache config locally
Evaluate flags locally
Refresh periodically
Receive streaming updates if supported
Report evaluation events asynchronously
Return safe defaults on failure

Cache Strategy

Startup fetch
→ Local memory cache
→ Background refresh
→ Last known good config fallback

👉 Interview Answer

SDKs should evaluate flags locally using cached configuration.

They should periodically refresh config, support last-known-good fallback, and never block critical user requests on the flag service.

1️⃣1️⃣ Consistency and Propagation

Propagation Options

Polling

SDK fetches config every N seconds.

Pros:

Simple
Robust

Cons:

Slower propagation

Streaming

Server pushes flag changes to SDKs.

Pros:

Fast updates

Cons:

More operational complexity

CDN / Edge Config

SDK downloads config from CDN.

Pros:

Highly available
Low latency

Cons:

Cache invalidation delay

👉 Interview Answer

Flag propagation can use polling, streaming, or CDN-backed config distribution.

For most flags, propagation within seconds is acceptable.

For kill switches, I would use faster propagation and shorter cache TTLs.

1️⃣2️⃣ Kill Switch

Purpose

Quickly disable risky behavior.

Examples:

Disable new checkout
Disable external dependency call
Disable expensive background job
Disable risky recommendation model

Requirements

Fast propagation
Safe default
High availability
Clear ownership
Audit log
Emergency permission path

👉 Interview Answer

Kill switches are safety controls.

They should be designed for fast propagation, high availability, and safe fallback.

The system should allow authorized operators to quickly disable risky features during incidents.

1️⃣3️⃣ A/B Testing and Experiments

Feature Flag vs Experiment

Feature flag:

turn feature on/off

Experiment:

assign users to variants and measure outcome

Experiment Variants

{
  "control": "old_ui",
  "variant_a": "new_ui_v1",
  "variant_b": "new_ui_v2"
}

Important Rule

Experiment assignment must be stable.

Use:

hash(experiment_key + user_id)

👉 Interview Answer

Feature flags can support experimentation, but experiments require stable assignment, exposure logging, and metrics analysis.

The system must record which variant the user saw, so outcomes can be attributed correctly.

1️⃣4️⃣ Audit and Governance

Why Needed?

Flag changes can impact production behavior instantly.

Audit Fields

Who changed the flag
What changed
Old value
New value
Environment
Reason
Timestamp
Approval status

Governance Controls

Require approval for production changes
Limit who can modify kill switches
Flag ownership
Expiration date
Stale flag cleanup
Change review

👉 Interview Answer

Feature flags are production controls, so every change must be audited.

For production environments, I would support approval workflows, ownership, change reason, rollback history, and stale flag cleanup.

1️⃣5️⃣ Stale Flag Cleanup

Problem

Old flags accumulate.

Risks:

Code complexity
Confusing behavior
Security risk
Performance overhead
Incorrect assumptions

Strategy

Track:

owner
created_at
last_evaluated_at
expiration_date
status

Cleanup Flow

Detect stale flag
→ Notify owner
→ Create cleanup ticket
→ Remove flag from code
→ Delete flag config

👉 Interview Answer

Stale flags are technical debt.

I would require each flag to have an owner, purpose, and expiration date.

The system should detect unused flags and notify owners to remove them from code and configuration.

1️⃣6️⃣ Security

Risks

Unauthorized production flag change
Cross-tenant flag leakage
Wrong environment update
Client-side exposure of sensitive rules
Flag used as permission system incorrectly
Secret values stored in flags

Controls

RBAC / ABAC for flag management
Environment-level permissions
Tenant-scoped flags
Audit logs
Approval workflow
Do not store secrets in flags
Server-side evaluation for sensitive flags

👉 Interview Answer

Feature flags are not a replacement for authorization.

Sensitive decisions should be evaluated server-side, and secrets should never be stored in flag values.

Production flag changes should require proper permissions and audit logs.

1️⃣7️⃣ Scaling Patterns

Pattern 1: Local Evaluation

Avoid remote calls on every request.

Pattern 2: Config Distribution via CDN

Highly available global config delivery.

Pattern 3: Streaming for Critical Updates

Fast propagation for kill switches.

Pattern 4: Event-driven Audit and Analytics

Flag changes and evaluations emit events.

Pattern 5: Tenant-aware Flag Partitioning

Large enterprise tenants can have dedicated configs.

👉 Interview Answer

To scale feature flags, I would rely on local SDK evaluation, distribute configs through cache or CDN, use streaming for critical updates, and process evaluation events asynchronously.

1️⃣8️⃣ Failure Handling

Common Failures

Flag service unavailable
SDK cannot refresh config
Bad flag rule deployed
Config propagation delayed
Evaluation error
Wrong targeting rule
Kill switch not propagated fast enough

Strategies

Last-known-good config
Safe default values
Config validation before publish
Gradual rollout
Rollback version
Emergency kill switch
Alert on evaluation error rate
Audit and approval workflow

👉 Interview Answer

Applications should not fail because the flag service is down.

SDKs should use last-known-good config and safe defaults.

Bad config should be prevented through validation, and rollback should be fast and auditable.

1️⃣9️⃣ Consistency Model

Stronger Consistency Needed For

Flag management changes
Audit logs
Approval workflow
Kill switch updates
Security-sensitive flags

Eventual Consistency Acceptable For

Normal rollout propagation
Evaluation event analytics
Dashboard metrics
Stale flag detection
Experiment reporting

👉 Interview Answer

Feature flag systems use mixed consistency.

Flag changes and audit logs need stronger correctness.

Runtime propagation can often be eventually consistent, but kill switches and security-sensitive flags need faster and safer propagation.

2️⃣0️⃣ Observability

Key Metrics

Flag evaluation latency
SDK config refresh success rate
Config propagation delay
Evaluation error rate
Flag service availability
Number of stale flags
Rollout percentage by flag
Kill switch activation count
Experiment exposure count
Flag change audit volume

👉 Interview Answer

I would monitor evaluation latency, SDK refresh success, propagation delay, evaluation errors, stale flags, kill switch usage, and experiment exposure counts.

These metrics show whether the flag system is safe and reliable.

2️⃣1️⃣ End-to-End Flow

Runtime Evaluation Flow

Application starts
→ SDK downloads flag config
→ Request arrives
→ App builds evaluation context
→ SDK evaluates flag locally
→ App uses enabled/disabled behavior
→ SDK emits evaluation event asynchronously

Rollout Flow

Engineer creates flag
→ Adds targeting rule
→ Publishes to staging
→ Validates behavior
→ Requests production approval
→ Rolls out 1%, 5%, 25%, 50%, 100%
→ Monitors metrics

Kill Switch Flow

Incident detected
→ Operator disables flag
→ Config update published
→ SDKs refresh or receive push
→ Feature disabled
→ Audit event recorded

Key Insight

Feature Flag System is not just runtime if/else — it is a safe configuration, rollout, experimentation, and incident-control platform.

🧠 Staff-Level Answer (Final)

👉 Interview Answer (Full Version)

When designing a feature flag system, I think of it as a runtime control plane for product and engineering behavior.

The system allows teams to enable or disable features, target specific users or tenants, gradually roll out changes, run experiments, and quickly disable risky behavior during incidents.

I would separate the control plane from the data plane.

The control plane includes the admin UI, management APIs, configuration store, approval workflow, and audit logs.

The data plane is the SDK and local evaluation engine running inside applications.

Runtime evaluation should usually happen locally using cached flag configuration, because calling a remote flag service on every request adds latency and creates availability risk.

The evaluation engine takes request context, such as user ID, tenant ID, region, plan, app version, and device, then evaluates targeting rules, percentage rollout, and defaults to return a final variation.

Percentage rollout should use deterministic hashing so users are assigned to stable buckets.

For experiments, the system must provide stable assignment and log exposure events so metrics can be analyzed correctly.

Safety is critical. Production flag changes should be audited, high-risk changes may require approval, and kill switches should propagate quickly.

SDKs should use last-known-good config and safe defaults if the flag service is unavailable.

The main trade-offs are consistency, latency, safety, operational complexity, and governance.

Ultimately, the goal is to let teams ship safely, roll out gradually, experiment reliably, and recover quickly from production issues.

⭐ Final Insight

Feature Flag System 的核心不是简单的 if/else，而是一个支持安全发布、灰度 rollout、实验分析、快速回滚和生产治理的 runtime control plane。

中文部分

🎯 Design Feature Flag System

1️⃣ 核心框架

在设计 Feature Flag System 时，我通常从以下几个方面分析：

Feature flag data model
Targeting and rollout rules
Flag evaluation engine
SDK and caching strategy
Control plane 和 data plane 分离
Audit、approval 和 governance
Kill switch 和 incident rollback
核心权衡：consistency vs latency vs safety

2️⃣ 核心需求

功能需求

创建、更新、删除 feature flags
按 environment 开启 / 关闭功能
按 user、tenant、region、plan、device、app version 定向
支持 percentage rollout
支持 A/B experiments
支持 kill switch
支持 audit logs
支持 approval workflow
支持 SDK evaluation

非功能需求

Flag evaluation 低延迟
高可用
安全 rollout
强 auditability
防止误开全量 production
支持 real-time 或 near-real-time propagation
SDK 在 flag service 不可用时仍能工作

👉 面试回答

Feature Flag System 用来在不重新部署代码的情况下，动态控制功能行为。

核心挑战是低延迟、安全地评估 flags，支持 targeting 和 gradual rollout，同时保持 auditability、consistency 和快速 rollback 能力。

3️⃣ 核心概念

Feature Flag

Feature flag 控制某个功能是否启用。

示例：

new_checkout_enabled = true

Environment

Flags 通常按 environment 区分：

dev
qa
staging
production

Targeting Rule

示例：

enable for enterprise tenants in US region

Percentage Rollout

示例：

enable for 10% of users

Kill Switch

用于立即关闭高风险功能的 flag。

👉 面试回答

我会将 feature flags 建模成 runtime configuration。

每个 flag 包含 environments、targeting rules、 rollout percentage、default value、owner 和 audit history。

Evaluation engine 根据 request context 决定最终返回的 flag value。

4️⃣ Main APIs

Create Flag

POST /api/flags

Request:

{
  "key": "new_checkout_enabled",
  "description": "Enable new checkout experience",
  "owner": "checkout-team",
  "defaultValue": false
}

Update Flag Rules

PATCH /api/flags/{flagKey}/rules

Evaluate Flag

POST /api/flags/evaluate

Request:

{
  "flagKey": "new_checkout_enabled",
  "context": {
    "userId": "u123",
    "tenantId": "t456",
    "region": "US",
    "plan": "enterprise",
    "appVersion": "2.5.0"
  }
}

Get All Flags for SDK

GET /api/sdk/flags?environment=production

👉 面试回答

我会提供 management APIs 来创建和更新 flags，也提供 evaluation APIs 给 runtime 使用。

在 production 中，大多数 evaluation 应该在 SDK 内部用 cached flag rules 本地完成，而不是每个请求都调用远程 flag service。

5️⃣ 数据模型

Feature Flag Table

feature_flag (
  flag_key VARCHAR PRIMARY KEY,
  name VARCHAR,
  description TEXT,
  owner_team VARCHAR,
  default_value JSON,
  flag_type VARCHAR,
  status VARCHAR,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
)

Flag Environment Config Table

flag_environment_config (
  flag_key VARCHAR,
  environment VARCHAR,
  enabled BOOLEAN,
  rules JSON,
  rollout_percentage INT,
  version BIGINT,
  updated_at TIMESTAMP,
  PRIMARY KEY (flag_key, environment)
)

Flag Rule Table

flag_rule (
  rule_id VARCHAR PRIMARY KEY,
  flag_key VARCHAR,
  environment VARCHAR,
  priority INT,
  condition JSON,
  variation JSON,
  created_at TIMESTAMP
)

Audit Log Table

flag_audit_log (
  audit_id VARCHAR PRIMARY KEY,
  flag_key VARCHAR,
  actor_id VARCHAR,
  action VARCHAR,
  old_value JSON,
  new_value JSON,
  reason TEXT,
  created_at TIMESTAMP
)

Evaluation Event Table

flag_evaluation_event (
  event_id VARCHAR PRIMARY KEY,
  flag_key VARCHAR,
  environment VARCHAR,
  user_id VARCHAR,
  tenant_id VARCHAR,
  variation JSON,
  evaluated_at TIMESTAMP
)

👉 面试回答

我会将 flag metadata、environment-specific config、 targeting rules、audit logs 和 evaluation events 分开。

Flag metadata 很少变化； environment config 和 rollout rules 才是 SDK runtime evaluation 需要的数据。

6️⃣ High-Level Architecture

Admin UI
→ Feature Flag Management API
→ Flag Config Store
→ Audit Log

Flag Config Publisher
→ CDN / Edge Cache / Streaming Channel
→ SDK Cache

Application
→ Feature Flag SDK
→ Local Evaluation Engine
→ Feature Enabled / Disabled

Main Components

Management Control Plane

Create flags
Update rules
Manage approvals
Record audit logs

Config Distribution Layer

Publish flag changes
Support cache / streaming updates
Serve SDK config

SDK

Cache flag config
Evaluate locally
Report evaluation events
Safe fallback

Evaluation Engine

Apply targeting rules
Handle percentage rollout
Return final variation

👉 面试回答

我会将 control plane 和 data plane 分开。

Control plane 负责管理 flag configuration。

Data plane 是 application service 内部的 SDK 和 evaluation engine。

这样可以避免每个请求都增加远程网络延迟。

7️⃣ Flag Evaluation Flow

Evaluation Context

示例：

{
  "userId": "u123",
  "tenantId": "t456",
  "region": "US",
  "plan": "enterprise",
  "device": "ios",
  "appVersion": "2.5.0"
}

Evaluation Steps

Load flag config
→ Check environment enabled
→ Apply kill switch
→ Evaluate targeting rules in priority order
→ Apply percentage rollout
→ Return matching variation
→ Fall back to default if no rule matches

Example Rule

{
  "if": {
    "plan": "enterprise",
    "region": "US"
  },
  "then": true
}

👉 面试回答

Flag evaluation 应该是 deterministic 的。

Engine 接收 flag config 和 request context，检查 targeting rules，应用 rollout logic，然后返回最终 variation。

如果 evaluation 失败， SDK 应该返回 safe default。

8️⃣ Percentage Rollout

Goal

逐步发布功能。

示例：

1% → 5% → 25% → 50% → 100%

Stable Bucketing

使用 deterministic hashing：

hash(flag_key + user_id) % 100

如果结果小于 rollout percentage：

enabled

Why Stable?

同一个用户应该稳定获得同一个 flag value。

👉 面试回答

对 percentage rollout，我会使用基于 flag key 和 user ID 的 deterministic hashing。

这样可以保证 stable bucketing，同一个用户会稳定看到同一种体验，同时 rollout percentage 可以逐步扩大。

9️⃣ Targeting Rules

Common Targeting Dimensions

userId
tenantId
plan
region
country
device
app version
browser
user segment
account age
beta user group

Rule Priority

Rules 应该按确定顺序执行。

示例：

Disable for blocked tenants
Enable for beta users
Enable 10% rollout
Default false

👉 面试回答

Targeting rules 应该按 priority order 执行。

这样 explicit allowlists 或 blocklists 可以覆盖 percentage rollout。

Rule ordering 必须清晰并可审计。

🔟 SDK and Caching

为什么需要 SDK？

Applications 需要快速 flag evaluation。

每个请求调用远程 service 会造成：

Higher latency
Higher failure risk
More load on flag service

SDK Responsibilities

Download flag config
Cache config locally
Evaluate flags locally
Refresh periodically
支持 streaming updates
异步上报 evaluation events
Failure 时返回 safe defaults

Cache Strategy

Startup fetch
→ Local memory cache
→ Background refresh
→ Last known good config fallback

👉 面试回答

SDK 应该使用 cached configuration 本地评估 flags。

它应该定期 refresh config，支持 last-known-good fallback，并且绝不应该让关键用户请求阻塞在 flag service 上。

1️⃣1️⃣ Consistency and Propagation

Propagation Options

Polling

SDK 每 N 秒拉取 config。

优点：

Simple
Robust

缺点：

Propagation 较慢

Streaming

Server 推送 flag changes 到 SDKs。

优点：

Updates 快

缺点：

Operational complexity 更高

CDN / Edge Config

SDK 从 CDN 下载 config。

优点：

Highly available
Low latency

缺点：

Cache invalidation delay

👉 面试回答

Flag propagation 可以使用 polling、streaming 或 CDN-backed config distribution。

对大多数 flags，几秒内 propagation 是可以接受的。

对 kill switches，我会使用更快 propagation 和更短 cache TTL。

1️⃣2️⃣ Kill Switch

Purpose

快速关闭高风险行为。

示例：

Disable new checkout
Disable external dependency call
Disable expensive background job
Disable risky recommendation model

Requirements

Fast propagation
Safe default
High availability
Clear ownership
Audit log
Emergency permission path

👉 面试回答

Kill switches 是安全控制。

它们应该支持 fast propagation、 high availability 和 safe fallback。

系统应该允许授权 operator 在 incident 期间快速关闭高风险功能。

1️⃣3️⃣ A/B Testing and Experiments

Feature Flag vs Experiment

Feature flag：

turn feature on/off

Experiment：

assign users to variants and measure outcome

Experiment Variants

{
  "control": "old_ui",
  "variant_a": "new_ui_v1",
  "variant_b": "new_ui_v2"
}

Important Rule

Experiment assignment 必须稳定。

使用：

hash(experiment_key + user_id)

👉 面试回答

Feature flags 可以支持 experimentation，但 experiments 需要 stable assignment、 exposure logging 和 metrics analysis。

系统必须记录用户看到的 variant，这样才能正确归因 outcomes。

1️⃣4️⃣ Audit and Governance

Why Needed?

Flag changes 可以立即影响 production behavior。

Audit Fields

Who changed the flag
What changed
Old value
New value
Environment
Reason
Timestamp
Approval status

Governance Controls

Production changes require approval
Limit who can modify kill switches
Flag ownership
Expiration date
Stale flag cleanup
Change review

👉 面试回答

Feature flags 是 production controls，所以每次 change 都必须 audit。

对 production environments，我会支持 approval workflow、ownership、 change reason、rollback history 和 stale flag cleanup。

1️⃣5️⃣ Stale Flag Cleanup

Problem

旧 flags 会不断堆积。

风险：

Code complexity
Confusing behavior
Security risk
Performance overhead
Incorrect assumptions

Strategy

追踪：

owner
created_at
last_evaluated_at
expiration_date
status

Cleanup Flow

Detect stale flag
→ Notify owner
→ Create cleanup ticket
→ Remove flag from code
→ Delete flag config

👉 面试回答

Stale flags 是 technical debt。

我会要求每个 flag 都有 owner、purpose 和 expiration date。

系统应该检测 unused flags，并通知 owners 从代码和配置中移除它们。

1️⃣6️⃣ Security

Risks

Unauthorized production flag change
Cross-tenant flag leakage
Wrong environment update
Client-side exposure of sensitive rules
Flag used incorrectly as permission system
Secrets stored in flags

Controls

RBAC / ABAC for flag management
Environment-level permissions
Tenant-scoped flags
Audit logs
Approval workflow
不要在 flags 中存 secrets
Sensitive flags 使用 server-side evaluation

👉 面试回答

Feature flags 不是 authorization 的替代品。

Sensitive decisions 应该 server-side evaluate， secrets 永远不应该存储在 flag values 中。

Production flag changes 必须有正确权限和 audit logs。

1️⃣7️⃣ Scaling Patterns

Pattern 1: Local Evaluation

避免每个 request 做 remote calls。

Pattern 2: Config Distribution via CDN

全球高可用分发 config。

Pattern 3: Streaming for Critical Updates

用于 kill switches 的快速 propagation。

Pattern 4: Event-driven Audit and Analytics

Flag changes 和 evaluations 产生 events。

Pattern 5: Tenant-aware Flag Partitioning

大型 enterprise tenants 可以有 dedicated configs。

👉 面试回答

为了扩展 feature flags，我会依赖 SDK local evaluation，通过 cache 或 CDN 分发 configs，对 critical updates 使用 streaming，并异步处理 evaluation events。

1️⃣8️⃣ Failure Handling

Common Failures

Flag service unavailable
SDK cannot refresh config
Bad flag rule deployed
Config propagation delayed
Evaluation error
Wrong targeting rule
Kill switch not propagated fast enough

Strategies

Last-known-good config
Safe default values
Publish 前做 config validation
Gradual rollout
Rollback version
Emergency kill switch
Alert on evaluation error rate
Audit and approval workflow

👉 面试回答

Application 不应该因为 flag service down 而失败。

SDKs 应该使用 last-known-good config 和 safe defaults。

Bad config 应该通过 validation 预防，并且 rollback 必须快速且可审计。

1️⃣9️⃣ Consistency Model

需要较强一致性的场景

Flag management changes
Audit logs
Approval workflow
Kill switch updates
Security-sensitive flags

可以最终一致的场景

Normal rollout propagation
Evaluation event analytics
Dashboard metrics
Stale flag detection
Experiment reporting

👉 面试回答

Feature flag system 使用 mixed consistency。

Flag changes 和 audit logs 需要更强正确性。

Runtime propagation 通常可以最终一致，但 kill switches 和 security-sensitive flags 需要更快且更安全的 propagation。

2️⃣0️⃣ Observability

Key Metrics

Flag evaluation latency
SDK config refresh success rate
Config propagation delay
Evaluation error rate
Flag service availability
Number of stale flags
Rollout percentage by flag
Kill switch activation count
Experiment exposure count
Flag change audit volume

👉 面试回答

我会监控 evaluation latency、SDK refresh success、 propagation delay、evaluation errors、stale flags、 kill switch usage 和 experiment exposure counts。

这些指标可以说明 flag system 是否安全可靠。

2️⃣1️⃣ End-to-End Flow

Runtime Evaluation Flow

Application starts
→ SDK downloads flag config
→ Request arrives
→ App builds evaluation context
→ SDK evaluates flag locally
→ App uses enabled/disabled behavior
→ SDK emits evaluation event asynchronously

Rollout Flow

Engineer creates flag
→ Adds targeting rule
→ Publishes to staging
→ Validates behavior
→ Requests production approval
→ Rolls out 1%, 5%, 25%, 50%, 100%
→ Monitors metrics

Kill Switch Flow

Incident detected
→ Operator disables flag
→ Config update published
→ SDKs refresh or receive push
→ Feature disabled
→ Audit event recorded

Key Insight

Feature Flag System 不是 runtime if/else，而是 safe configuration、rollout、experimentation 和 incident-control platform。

🧠 Staff-Level Answer（最终版）

👉 面试回答（完整背诵版）

在设计 Feature Flag System 时，我会把它看作一个 runtime control plane，用来控制产品和工程行为。

系统允许团队 enable 或 disable features， target 特定 users 或 tenants， gradual rollout changes，运行 experiments，并在 incidents 中快速关闭高风险行为。

我会将 control plane 和 data plane 分离。

Control plane 包括 admin UI、management APIs、 configuration store、approval workflow 和 audit logs。

Data plane 是运行在 applications 中的 SDK 和 local evaluation engine。

Runtime evaluation 通常应该本地完成，使用 cached flag configuration，因为每个请求调用远程 flag service 会增加 latency 并引入 availability risk。

Evaluation engine 接收 request context，例如 user ID、tenant ID、region、plan、 app version 和 device，然后评估 targeting rules、percentage rollout 和 default value，最终返回 variation。

Percentage rollout 应该使用 deterministic hashing，这样用户会被稳定分配到固定 bucket。

对 experiments，系统必须提供 stable assignment，并记录 exposure events，这样 metrics 才能正确分析。

Safety 非常关键。 Production flag changes 必须 audit，高风险 changes 可能需要 approval， kill switches 应该快速传播。

如果 flag service 不可用， SDKs 应该使用 last-known-good config 和 safe defaults。

核心权衡包括 consistency、latency、safety、 operational complexity 和 governance。

最终目标是让 teams 可以安全发布、逐步 rollout、可靠实验，并在 production issue 出现时快速恢复。

⭐ Final Insight

Feature Flag System 的核心不是简单的 if/else，而是一个支持安全发布、灰度 rollout、实验分析、快速回滚和生产治理的 runtime control plane。