🎯 Design Feature Flag System
1️⃣ Core Framework
When discussing Feature Flag System design, I frame it as:
- Feature flag data model
- Targeting and rollout rules
- Flag evaluation engine
- SDK and caching strategy
- Control plane and data plane separation
- Audit, approval, and governance
- Kill switch and incident rollback
- Trade-offs: consistency vs latency vs safety
2️⃣ Core Requirements
Functional Requirements
- Create, update, delete feature flags
- Enable / disable features by environment
- Target by user, tenant, region, plan, device, app version
- Support percentage rollout
- Support A/B experiments
- Support kill switch
- Support audit logs
- Support approval workflow
- Support SDK evaluation
Non-functional Requirements
- Low-latency flag evaluation
- High availability
- Safe rollout
- Strong auditability
- Prevent accidental global rollout
- Support real-time or near-real-time propagation
- SDK should work even if flag service is unavailable
👉 Interview Answer
A feature flag system controls feature behavior at runtime without redeploying code.
The main challenge is safely evaluating flags with low latency, supporting targeting and gradual rollout, while maintaining auditability, consistency, and fast rollback.
3️⃣ Core Concepts
Feature Flag
A flag controls whether a feature is enabled.
Example:
new_checkout_enabled = true
Environment
Flags usually differ by environment:
dev
qa
staging
production
Targeting Rule
Example:
enable for enterprise tenants in US region
Percentage Rollout
Example:
enable for 10% of users
Kill Switch
A flag used to immediately disable risky behavior.
👉 Interview Answer
I would model feature flags as runtime configuration.
Each flag has environments, targeting rules, rollout percentage, default value, owner, and audit history.
The evaluation engine decides the final value for each request context.
4️⃣ Main APIs
Create Flag
POST /api/flags
Request:
{
"key": "new_checkout_enabled",
"description": "Enable new checkout experience",
"owner": "checkout-team",
"defaultValue": false
}
Update Flag Rules
PATCH /api/flags/{flagKey}/rules
Evaluate Flag
POST /api/flags/evaluate
Request:
{
"flagKey": "new_checkout_enabled",
"context": {
"userId": "u123",
"tenantId": "t456",
"region": "US",
"plan": "enterprise",
"appVersion": "2.5.0"
}
}
Get All Flags for SDK
GET /api/sdk/flags?environment=production
👉 Interview Answer
I would expose management APIs for creating and updating flags, and evaluation APIs for runtime usage.
In production, most evaluation should happen locally inside SDKs using cached flag rules, not by calling the flag service on every request.
5️⃣ Data Model
Feature Flag Table
feature_flag (
flag_key VARCHAR PRIMARY KEY,
name VARCHAR,
description TEXT,
owner_team VARCHAR,
default_value JSON,
flag_type VARCHAR,
status VARCHAR,
created_at TIMESTAMP,
updated_at TIMESTAMP
)
Flag Environment Config Table
flag_environment_config (
flag_key VARCHAR,
environment VARCHAR,
enabled BOOLEAN,
rules JSON,
rollout_percentage INT,
version BIGINT,
updated_at TIMESTAMP,
PRIMARY KEY (flag_key, environment)
)
Flag Rule Table
flag_rule (
rule_id VARCHAR PRIMARY KEY,
flag_key VARCHAR,
environment VARCHAR,
priority INT,
condition JSON,
variation JSON,
created_at TIMESTAMP
)
Audit Log Table
flag_audit_log (
audit_id VARCHAR PRIMARY KEY,
flag_key VARCHAR,
actor_id VARCHAR,
action VARCHAR,
old_value JSON,
new_value JSON,
reason TEXT,
created_at TIMESTAMP
)
Evaluation Event Table
flag_evaluation_event (
event_id VARCHAR PRIMARY KEY,
flag_key VARCHAR,
environment VARCHAR,
user_id VARCHAR,
tenant_id VARCHAR,
variation JSON,
evaluated_at TIMESTAMP
)
👉 Interview Answer
I would separate flag metadata, environment-specific config, targeting rules, audit logs, and evaluation events.
Flag metadata changes rarely, but environment config and rollout rules are what SDKs need for runtime evaluation.
6️⃣ High-Level Architecture
Admin UI
→ Feature Flag Management API
→ Flag Config Store
→ Audit Log
Flag Config Publisher
→ CDN / Edge Cache / Streaming Channel
→ SDK Cache
Application
→ Feature Flag SDK
→ Local Evaluation Engine
→ Feature Enabled / Disabled
Main Components
Management Control Plane
- Create flags
- Update rules
- Manage approvals
- Record audit logs
Config Distribution Layer
- Publishes flag changes
- Supports cache / streaming updates
- Serves SDK config
SDK
- Caches flag config
- Evaluates locally
- Reports evaluation events
- Falls back safely
Evaluation Engine
- Applies targeting rules
- Handles percentage rollout
- Returns final variation
👉 Interview Answer
I would separate control plane from data plane.
The control plane manages flag configuration.
The data plane is the SDK and evaluation engine inside application services.
This avoids adding network latency to every request.
7️⃣ Flag Evaluation Flow
Evaluation Context
Example:
{
"userId": "u123",
"tenantId": "t456",
"region": "US",
"plan": "enterprise",
"device": "ios",
"appVersion": "2.5.0"
}
Evaluation Steps
Load flag config
→ Check environment enabled
→ Apply kill switch
→ Evaluate targeting rules in priority order
→ Apply percentage rollout
→ Return matching variation
→ Fall back to default if no rule matches
Example Rule
{
"if": {
"plan": "enterprise",
"region": "US"
},
"then": true
}
👉 Interview Answer
Flag evaluation should be deterministic.
The engine takes flag config and request context, checks targeting rules, applies rollout logic, and returns the final variation.
If evaluation fails, the SDK should return a safe default.
8️⃣ Percentage Rollout
Goal
Gradually release feature to users.
Example:
1% → 5% → 25% → 50% → 100%
Stable Bucketing
Use deterministic hashing:
hash(flag_key + user_id) % 100
If result < rollout percentage:
enabled
Why Stable?
The same user should consistently get the same flag value.
👉 Interview Answer
For percentage rollout, I would use deterministic hashing based on flag key and user ID.
This ensures stable bucketing, so the same user consistently sees the same experience while rollout percentage changes gradually.
9️⃣ Targeting Rules
Common Targeting Dimensions
- userId
- tenantId
- plan
- region
- country
- device
- app version
- browser
- user segment
- account age
- beta user group
Rule Priority
Rules should be evaluated in deterministic order.
Example:
1. Disable for blocked tenants
2. Enable for beta users
3. Enable 10% rollout
4. Default false
👉 Interview Answer
Targeting rules should be evaluated in priority order.
This allows explicit allowlists or blocklists to override percentage rollout.
Rule ordering must be clear and auditable.
🔟 SDK and Caching
Why SDK?
Applications need fast flag evaluation.
Calling remote service on every request causes:
- Higher latency
- Higher failure risk
- More load on flag service
SDK Responsibilities
- Download flag config
- Cache config locally
- Evaluate flags locally
- Refresh periodically
- Receive streaming updates if supported
- Report evaluation events asynchronously
- Return safe defaults on failure
Cache Strategy
Startup fetch
→ Local memory cache
→ Background refresh
→ Last known good config fallback
👉 Interview Answer
SDKs should evaluate flags locally using cached configuration.
They should periodically refresh config, support last-known-good fallback, and never block critical user requests on the flag service.
1️⃣1️⃣ Consistency and Propagation
Propagation Options
Polling
SDK fetches config every N seconds.
Pros:
- Simple
- Robust
Cons:
- Slower propagation
Streaming
Server pushes flag changes to SDKs.
Pros:
- Fast updates
Cons:
- More operational complexity
CDN / Edge Config
SDK downloads config from CDN.
Pros:
- Highly available
- Low latency
Cons:
- Cache invalidation delay
👉 Interview Answer
Flag propagation can use polling, streaming, or CDN-backed config distribution.
For most flags, propagation within seconds is acceptable.
For kill switches, I would use faster propagation and shorter cache TTLs.
1️⃣2️⃣ Kill Switch
Purpose
Quickly disable risky behavior.
Examples:
- Disable new checkout
- Disable external dependency call
- Disable expensive background job
- Disable risky recommendation model
Requirements
- Fast propagation
- Safe default
- High availability
- Clear ownership
- Audit log
- Emergency permission path
👉 Interview Answer
Kill switches are safety controls.
They should be designed for fast propagation, high availability, and safe fallback.
The system should allow authorized operators to quickly disable risky features during incidents.
1️⃣3️⃣ A/B Testing and Experiments
Feature Flag vs Experiment
Feature flag:
turn feature on/off
Experiment:
assign users to variants and measure outcome
Experiment Variants
{
"control": "old_ui",
"variant_a": "new_ui_v1",
"variant_b": "new_ui_v2"
}
Important Rule
Experiment assignment must be stable.
Use:
hash(experiment_key + user_id)
👉 Interview Answer
Feature flags can support experimentation, but experiments require stable assignment, exposure logging, and metrics analysis.
The system must record which variant the user saw, so outcomes can be attributed correctly.
1️⃣4️⃣ Audit and Governance
Why Needed?
Flag changes can impact production behavior instantly.
Audit Fields
- Who changed the flag
- What changed
- Old value
- New value
- Environment
- Reason
- Timestamp
- Approval status
Governance Controls
- Require approval for production changes
- Limit who can modify kill switches
- Flag ownership
- Expiration date
- Stale flag cleanup
- Change review
👉 Interview Answer
Feature flags are production controls, so every change must be audited.
For production environments, I would support approval workflows, ownership, change reason, rollback history, and stale flag cleanup.
1️⃣5️⃣ Stale Flag Cleanup
Problem
Old flags accumulate.
Risks:
- Code complexity
- Confusing behavior
- Security risk
- Performance overhead
- Incorrect assumptions
Strategy
Track:
owner
created_at
last_evaluated_at
expiration_date
status
Cleanup Flow
Detect stale flag
→ Notify owner
→ Create cleanup ticket
→ Remove flag from code
→ Delete flag config
👉 Interview Answer
Stale flags are technical debt.
I would require each flag to have an owner, purpose, and expiration date.
The system should detect unused flags and notify owners to remove them from code and configuration.
1️⃣6️⃣ Security
Risks
- Unauthorized production flag change
- Cross-tenant flag leakage
- Wrong environment update
- Client-side exposure of sensitive rules
- Flag used as permission system incorrectly
- Secret values stored in flags
Controls
- RBAC / ABAC for flag management
- Environment-level permissions
- Tenant-scoped flags
- Audit logs
- Approval workflow
- Do not store secrets in flags
- Server-side evaluation for sensitive flags
👉 Interview Answer
Feature flags are not a replacement for authorization.
Sensitive decisions should be evaluated server-side, and secrets should never be stored in flag values.
Production flag changes should require proper permissions and audit logs.
1️⃣7️⃣ Scaling Patterns
Pattern 1: Local Evaluation
Avoid remote calls on every request.
Pattern 2: Config Distribution via CDN
Highly available global config delivery.
Pattern 3: Streaming for Critical Updates
Fast propagation for kill switches.
Pattern 4: Event-driven Audit and Analytics
Flag changes and evaluations emit events.
Pattern 5: Tenant-aware Flag Partitioning
Large enterprise tenants can have dedicated configs.
👉 Interview Answer
To scale feature flags, I would rely on local SDK evaluation, distribute configs through cache or CDN, use streaming for critical updates, and process evaluation events asynchronously.
1️⃣8️⃣ Failure Handling
Common Failures
- Flag service unavailable
- SDK cannot refresh config
- Bad flag rule deployed
- Config propagation delayed
- Evaluation error
- Wrong targeting rule
- Kill switch not propagated fast enough
Strategies
- Last-known-good config
- Safe default values
- Config validation before publish
- Gradual rollout
- Rollback version
- Emergency kill switch
- Alert on evaluation error rate
- Audit and approval workflow
👉 Interview Answer
Applications should not fail because the flag service is down.
SDKs should use last-known-good config and safe defaults.
Bad config should be prevented through validation, and rollback should be fast and auditable.
1️⃣9️⃣ Consistency Model
Stronger Consistency Needed For
- Flag management changes
- Audit logs
- Approval workflow
- Kill switch updates
- Security-sensitive flags
Eventual Consistency Acceptable For
- Normal rollout propagation
- Evaluation event analytics
- Dashboard metrics
- Stale flag detection
- Experiment reporting
👉 Interview Answer
Feature flag systems use mixed consistency.
Flag changes and audit logs need stronger correctness.
Runtime propagation can often be eventually consistent, but kill switches and security-sensitive flags need faster and safer propagation.
2️⃣0️⃣ Observability
Key Metrics
- Flag evaluation latency
- SDK config refresh success rate
- Config propagation delay
- Evaluation error rate
- Flag service availability
- Number of stale flags
- Rollout percentage by flag
- Kill switch activation count
- Experiment exposure count
- Flag change audit volume
👉 Interview Answer
I would monitor evaluation latency, SDK refresh success, propagation delay, evaluation errors, stale flags, kill switch usage, and experiment exposure counts.
These metrics show whether the flag system is safe and reliable.
2️⃣1️⃣ End-to-End Flow
Runtime Evaluation Flow
Application starts
→ SDK downloads flag config
→ Request arrives
→ App builds evaluation context
→ SDK evaluates flag locally
→ App uses enabled/disabled behavior
→ SDK emits evaluation event asynchronously
Rollout Flow
Engineer creates flag
→ Adds targeting rule
→ Publishes to staging
→ Validates behavior
→ Requests production approval
→ Rolls out 1%, 5%, 25%, 50%, 100%
→ Monitors metrics
Kill Switch Flow
Incident detected
→ Operator disables flag
→ Config update published
→ SDKs refresh or receive push
→ Feature disabled
→ Audit event recorded
Key Insight
Feature Flag System is not just runtime if/else — it is a safe configuration, rollout, experimentation, and incident-control platform.
🧠 Staff-Level Answer (Final)
👉 Interview Answer (Full Version)
When designing a feature flag system, I think of it as a runtime control plane for product and engineering behavior.
The system allows teams to enable or disable features, target specific users or tenants, gradually roll out changes, run experiments, and quickly disable risky behavior during incidents.
I would separate the control plane from the data plane.
The control plane includes the admin UI, management APIs, configuration store, approval workflow, and audit logs.
The data plane is the SDK and local evaluation engine running inside applications.
Runtime evaluation should usually happen locally using cached flag configuration, because calling a remote flag service on every request adds latency and creates availability risk.
The evaluation engine takes request context, such as user ID, tenant ID, region, plan, app version, and device, then evaluates targeting rules, percentage rollout, and defaults to return a final variation.
Percentage rollout should use deterministic hashing so users are assigned to stable buckets.
For experiments, the system must provide stable assignment and log exposure events so metrics can be analyzed correctly.
Safety is critical. Production flag changes should be audited, high-risk changes may require approval, and kill switches should propagate quickly.
SDKs should use last-known-good config and safe defaults if the flag service is unavailable.
The main trade-offs are consistency, latency, safety, operational complexity, and governance.
Ultimately, the goal is to let teams ship safely, roll out gradually, experiment reliably, and recover quickly from production issues.
⭐ Final Insight
Feature Flag System 的核心不是简单的 if/else, 而是一个支持安全发布、灰度 rollout、实验分析、快速回滚 和生产治理的 runtime control plane。
中文部分
🎯 Design Feature Flag System
1️⃣ 核心框架
在设计 Feature Flag System 时,我通常从以下几个方面分析:
- Feature flag data model
- Targeting and rollout rules
- Flag evaluation engine
- SDK and caching strategy
- Control plane 和 data plane 分离
- Audit、approval 和 governance
- Kill switch 和 incident rollback
- 核心权衡:consistency vs latency vs safety
2️⃣ 核心需求
功能需求
- 创建、更新、删除 feature flags
- 按 environment 开启 / 关闭功能
- 按 user、tenant、region、plan、device、app version 定向
- 支持 percentage rollout
- 支持 A/B experiments
- 支持 kill switch
- 支持 audit logs
- 支持 approval workflow
- 支持 SDK evaluation
非功能需求
- Flag evaluation 低延迟
- 高可用
- 安全 rollout
- 强 auditability
- 防止误开全量 production
- 支持 real-time 或 near-real-time propagation
- SDK 在 flag service 不可用时仍能工作
👉 面试回答
Feature Flag System 用来在不重新部署代码的情况下, 动态控制功能行为。
核心挑战是低延迟、安全地评估 flags, 支持 targeting 和 gradual rollout, 同时保持 auditability、consistency 和快速 rollback 能力。
3️⃣ 核心概念
Feature Flag
Feature flag 控制某个功能是否启用。
示例:
new_checkout_enabled = true
Environment
Flags 通常按 environment 区分:
dev
qa
staging
production
Targeting Rule
示例:
enable for enterprise tenants in US region
Percentage Rollout
示例:
enable for 10% of users
Kill Switch
用于立即关闭高风险功能的 flag。
👉 面试回答
我会将 feature flags 建模成 runtime configuration。
每个 flag 包含 environments、targeting rules、 rollout percentage、default value、owner 和 audit history。
Evaluation engine 根据 request context 决定最终返回的 flag value。
4️⃣ Main APIs
Create Flag
POST /api/flags
Request:
{
"key": "new_checkout_enabled",
"description": "Enable new checkout experience",
"owner": "checkout-team",
"defaultValue": false
}
Update Flag Rules
PATCH /api/flags/{flagKey}/rules
Evaluate Flag
POST /api/flags/evaluate
Request:
{
"flagKey": "new_checkout_enabled",
"context": {
"userId": "u123",
"tenantId": "t456",
"region": "US",
"plan": "enterprise",
"appVersion": "2.5.0"
}
}
Get All Flags for SDK
GET /api/sdk/flags?environment=production
👉 面试回答
我会提供 management APIs 来创建和更新 flags, 也提供 evaluation APIs 给 runtime 使用。
在 production 中, 大多数 evaluation 应该在 SDK 内部用 cached flag rules 本地完成, 而不是每个请求都调用远程 flag service。
5️⃣ 数据模型
Feature Flag Table
feature_flag (
flag_key VARCHAR PRIMARY KEY,
name VARCHAR,
description TEXT,
owner_team VARCHAR,
default_value JSON,
flag_type VARCHAR,
status VARCHAR,
created_at TIMESTAMP,
updated_at TIMESTAMP
)
Flag Environment Config Table
flag_environment_config (
flag_key VARCHAR,
environment VARCHAR,
enabled BOOLEAN,
rules JSON,
rollout_percentage INT,
version BIGINT,
updated_at TIMESTAMP,
PRIMARY KEY (flag_key, environment)
)
Flag Rule Table
flag_rule (
rule_id VARCHAR PRIMARY KEY,
flag_key VARCHAR,
environment VARCHAR,
priority INT,
condition JSON,
variation JSON,
created_at TIMESTAMP
)
Audit Log Table
flag_audit_log (
audit_id VARCHAR PRIMARY KEY,
flag_key VARCHAR,
actor_id VARCHAR,
action VARCHAR,
old_value JSON,
new_value JSON,
reason TEXT,
created_at TIMESTAMP
)
Evaluation Event Table
flag_evaluation_event (
event_id VARCHAR PRIMARY KEY,
flag_key VARCHAR,
environment VARCHAR,
user_id VARCHAR,
tenant_id VARCHAR,
variation JSON,
evaluated_at TIMESTAMP
)
👉 面试回答
我会将 flag metadata、environment-specific config、 targeting rules、audit logs 和 evaluation events 分开。
Flag metadata 很少变化; environment config 和 rollout rules 才是 SDK runtime evaluation 需要的数据。
6️⃣ High-Level Architecture
Admin UI
→ Feature Flag Management API
→ Flag Config Store
→ Audit Log
Flag Config Publisher
→ CDN / Edge Cache / Streaming Channel
→ SDK Cache
Application
→ Feature Flag SDK
→ Local Evaluation Engine
→ Feature Enabled / Disabled
Main Components
Management Control Plane
- Create flags
- Update rules
- Manage approvals
- Record audit logs
Config Distribution Layer
- Publish flag changes
- Support cache / streaming updates
- Serve SDK config
SDK
- Cache flag config
- Evaluate locally
- Report evaluation events
- Safe fallback
Evaluation Engine
- Apply targeting rules
- Handle percentage rollout
- Return final variation
👉 面试回答
我会将 control plane 和 data plane 分开。
Control plane 负责管理 flag configuration。
Data plane 是 application service 内部的 SDK 和 evaluation engine。
这样可以避免每个请求都增加远程网络延迟。
7️⃣ Flag Evaluation Flow
Evaluation Context
示例:
{
"userId": "u123",
"tenantId": "t456",
"region": "US",
"plan": "enterprise",
"device": "ios",
"appVersion": "2.5.0"
}
Evaluation Steps
Load flag config
→ Check environment enabled
→ Apply kill switch
→ Evaluate targeting rules in priority order
→ Apply percentage rollout
→ Return matching variation
→ Fall back to default if no rule matches
Example Rule
{
"if": {
"plan": "enterprise",
"region": "US"
},
"then": true
}
👉 面试回答
Flag evaluation 应该是 deterministic 的。
Engine 接收 flag config 和 request context, 检查 targeting rules, 应用 rollout logic, 然后返回最终 variation。
如果 evaluation 失败, SDK 应该返回 safe default。
8️⃣ Percentage Rollout
Goal
逐步发布功能。
示例:
1% → 5% → 25% → 50% → 100%
Stable Bucketing
使用 deterministic hashing:
hash(flag_key + user_id) % 100
如果结果小于 rollout percentage:
enabled
Why Stable?
同一个用户应该稳定获得同一个 flag value。
👉 面试回答
对 percentage rollout, 我会使用基于 flag key 和 user ID 的 deterministic hashing。
这样可以保证 stable bucketing, 同一个用户会稳定看到同一种体验, 同时 rollout percentage 可以逐步扩大。
9️⃣ Targeting Rules
Common Targeting Dimensions
- userId
- tenantId
- plan
- region
- country
- device
- app version
- browser
- user segment
- account age
- beta user group
Rule Priority
Rules 应该按确定顺序执行。
示例:
1. Disable for blocked tenants
2. Enable for beta users
3. Enable 10% rollout
4. Default false
👉 面试回答
Targeting rules 应该按 priority order 执行。
这样 explicit allowlists 或 blocklists 可以覆盖 percentage rollout。
Rule ordering 必须清晰并可审计。
🔟 SDK and Caching
为什么需要 SDK?
Applications 需要快速 flag evaluation。
每个请求调用远程 service 会造成:
- Higher latency
- Higher failure risk
- More load on flag service
SDK Responsibilities
- Download flag config
- Cache config locally
- Evaluate flags locally
- Refresh periodically
- 支持 streaming updates
- 异步上报 evaluation events
- Failure 时返回 safe defaults
Cache Strategy
Startup fetch
→ Local memory cache
→ Background refresh
→ Last known good config fallback
👉 面试回答
SDK 应该使用 cached configuration 本地评估 flags。
它应该定期 refresh config, 支持 last-known-good fallback, 并且绝不应该让关键用户请求阻塞在 flag service 上。
1️⃣1️⃣ Consistency and Propagation
Propagation Options
Polling
SDK 每 N 秒拉取 config。
优点:
- Simple
- Robust
缺点:
- Propagation 较慢
Streaming
Server 推送 flag changes 到 SDKs。
优点:
- Updates 快
缺点:
- Operational complexity 更高
CDN / Edge Config
SDK 从 CDN 下载 config。
优点:
- Highly available
- Low latency
缺点:
- Cache invalidation delay
👉 面试回答
Flag propagation 可以使用 polling、streaming 或 CDN-backed config distribution。
对大多数 flags, 几秒内 propagation 是可以接受的。
对 kill switches, 我会使用更快 propagation 和更短 cache TTL。
1️⃣2️⃣ Kill Switch
Purpose
快速关闭高风险行为。
示例:
- Disable new checkout
- Disable external dependency call
- Disable expensive background job
- Disable risky recommendation model
Requirements
- Fast propagation
- Safe default
- High availability
- Clear ownership
- Audit log
- Emergency permission path
👉 面试回答
Kill switches 是安全控制。
它们应该支持 fast propagation、 high availability 和 safe fallback。
系统应该允许授权 operator 在 incident 期间快速关闭高风险功能。
1️⃣3️⃣ A/B Testing and Experiments
Feature Flag vs Experiment
Feature flag:
turn feature on/off
Experiment:
assign users to variants and measure outcome
Experiment Variants
{
"control": "old_ui",
"variant_a": "new_ui_v1",
"variant_b": "new_ui_v2"
}
Important Rule
Experiment assignment 必须稳定。
使用:
hash(experiment_key + user_id)
👉 面试回答
Feature flags 可以支持 experimentation, 但 experiments 需要 stable assignment、 exposure logging 和 metrics analysis。
系统必须记录用户看到的 variant, 这样才能正确归因 outcomes。
1️⃣4️⃣ Audit and Governance
Why Needed?
Flag changes 可以立即影响 production behavior。
Audit Fields
- Who changed the flag
- What changed
- Old value
- New value
- Environment
- Reason
- Timestamp
- Approval status
Governance Controls
- Production changes require approval
- Limit who can modify kill switches
- Flag ownership
- Expiration date
- Stale flag cleanup
- Change review
👉 面试回答
Feature flags 是 production controls, 所以每次 change 都必须 audit。
对 production environments, 我会支持 approval workflow、ownership、 change reason、rollback history 和 stale flag cleanup。
1️⃣5️⃣ Stale Flag Cleanup
Problem
旧 flags 会不断堆积。
风险:
- Code complexity
- Confusing behavior
- Security risk
- Performance overhead
- Incorrect assumptions
Strategy
追踪:
owner
created_at
last_evaluated_at
expiration_date
status
Cleanup Flow
Detect stale flag
→ Notify owner
→ Create cleanup ticket
→ Remove flag from code
→ Delete flag config
👉 面试回答
Stale flags 是 technical debt。
我会要求每个 flag 都有 owner、purpose 和 expiration date。
系统应该检测 unused flags, 并通知 owners 从代码和配置中移除它们。
1️⃣6️⃣ Security
Risks
- Unauthorized production flag change
- Cross-tenant flag leakage
- Wrong environment update
- Client-side exposure of sensitive rules
- Flag used incorrectly as permission system
- Secrets stored in flags
Controls
- RBAC / ABAC for flag management
- Environment-level permissions
- Tenant-scoped flags
- Audit logs
- Approval workflow
- 不要在 flags 中存 secrets
- Sensitive flags 使用 server-side evaluation
👉 面试回答
Feature flags 不是 authorization 的替代品。
Sensitive decisions 应该 server-side evaluate, secrets 永远不应该存储在 flag values 中。
Production flag changes 必须有正确权限 和 audit logs。
1️⃣7️⃣ Scaling Patterns
Pattern 1: Local Evaluation
避免每个 request 做 remote calls。
Pattern 2: Config Distribution via CDN
全球高可用分发 config。
Pattern 3: Streaming for Critical Updates
用于 kill switches 的快速 propagation。
Pattern 4: Event-driven Audit and Analytics
Flag changes 和 evaluations 产生 events。
Pattern 5: Tenant-aware Flag Partitioning
大型 enterprise tenants 可以有 dedicated configs。
👉 面试回答
为了扩展 feature flags, 我会依赖 SDK local evaluation, 通过 cache 或 CDN 分发 configs, 对 critical updates 使用 streaming, 并异步处理 evaluation events。
1️⃣8️⃣ Failure Handling
Common Failures
- Flag service unavailable
- SDK cannot refresh config
- Bad flag rule deployed
- Config propagation delayed
- Evaluation error
- Wrong targeting rule
- Kill switch not propagated fast enough
Strategies
- Last-known-good config
- Safe default values
- Publish 前做 config validation
- Gradual rollout
- Rollback version
- Emergency kill switch
- Alert on evaluation error rate
- Audit and approval workflow
👉 面试回答
Application 不应该因为 flag service down 而失败。
SDKs 应该使用 last-known-good config 和 safe defaults。
Bad config 应该通过 validation 预防, 并且 rollback 必须快速且可审计。
1️⃣9️⃣ Consistency Model
需要较强一致性的场景
- Flag management changes
- Audit logs
- Approval workflow
- Kill switch updates
- Security-sensitive flags
可以最终一致的场景
- Normal rollout propagation
- Evaluation event analytics
- Dashboard metrics
- Stale flag detection
- Experiment reporting
👉 面试回答
Feature flag system 使用 mixed consistency。
Flag changes 和 audit logs 需要更强正确性。
Runtime propagation 通常可以最终一致, 但 kill switches 和 security-sensitive flags 需要更快且更安全的 propagation。
2️⃣0️⃣ Observability
Key Metrics
- Flag evaluation latency
- SDK config refresh success rate
- Config propagation delay
- Evaluation error rate
- Flag service availability
- Number of stale flags
- Rollout percentage by flag
- Kill switch activation count
- Experiment exposure count
- Flag change audit volume
👉 面试回答
我会监控 evaluation latency、SDK refresh success、 propagation delay、evaluation errors、stale flags、 kill switch usage 和 experiment exposure counts。
这些指标可以说明 flag system 是否安全可靠。
2️⃣1️⃣ End-to-End Flow
Runtime Evaluation Flow
Application starts
→ SDK downloads flag config
→ Request arrives
→ App builds evaluation context
→ SDK evaluates flag locally
→ App uses enabled/disabled behavior
→ SDK emits evaluation event asynchronously
Rollout Flow
Engineer creates flag
→ Adds targeting rule
→ Publishes to staging
→ Validates behavior
→ Requests production approval
→ Rolls out 1%, 5%, 25%, 50%, 100%
→ Monitors metrics
Kill Switch Flow
Incident detected
→ Operator disables flag
→ Config update published
→ SDKs refresh or receive push
→ Feature disabled
→ Audit event recorded
Key Insight
Feature Flag System 不是 runtime if/else, 而是 safe configuration、rollout、experimentation 和 incident-control platform。
🧠 Staff-Level Answer(最终版)
👉 面试回答(完整背诵版)
在设计 Feature Flag System 时, 我会把它看作一个 runtime control plane, 用来控制产品和工程行为。
系统允许团队 enable 或 disable features, target 特定 users 或 tenants, gradual rollout changes, 运行 experiments, 并在 incidents 中快速关闭高风险行为。
我会将 control plane 和 data plane 分离。
Control plane 包括 admin UI、management APIs、 configuration store、approval workflow 和 audit logs。
Data plane 是运行在 applications 中的 SDK 和 local evaluation engine。
Runtime evaluation 通常应该本地完成, 使用 cached flag configuration, 因为每个请求调用远程 flag service 会增加 latency 并引入 availability risk。
Evaluation engine 接收 request context, 例如 user ID、tenant ID、region、plan、 app version 和 device, 然后评估 targeting rules、percentage rollout 和 default value,最终返回 variation。
Percentage rollout 应该使用 deterministic hashing, 这样用户会被稳定分配到固定 bucket。
对 experiments, 系统必须提供 stable assignment, 并记录 exposure events, 这样 metrics 才能正确分析。
Safety 非常关键。 Production flag changes 必须 audit, 高风险 changes 可能需要 approval, kill switches 应该快速传播。
如果 flag service 不可用, SDKs 应该使用 last-known-good config 和 safe defaults。
核心权衡包括 consistency、latency、safety、 operational complexity 和 governance。
最终目标是让 teams 可以安全发布、 逐步 rollout、可靠实验, 并在 production issue 出现时快速恢复。
⭐ Final Insight
Feature Flag System 的核心不是简单的 if/else, 而是一个支持安全发布、灰度 rollout、实验分析、快速回滚 和生产治理的 runtime control plane。
Implement