System Design Deep Dive - 29 Design Feature Flag System

Post by ailswan May. 22, 2026

中文 ↓

🎯 Design Feature Flag System


1️⃣ Core Framework

When discussing Feature Flag System design, I frame it as:

  1. Feature flag data model
  2. Targeting and rollout rules
  3. Flag evaluation engine
  4. SDK and caching strategy
  5. Control plane and data plane separation
  6. Audit, approval, and governance
  7. Kill switch and incident rollback
  8. Trade-offs: consistency vs latency vs safety

2️⃣ Core Requirements


Functional Requirements


Non-functional Requirements


👉 Interview Answer

A feature flag system controls feature behavior at runtime without redeploying code.

The main challenge is safely evaluating flags with low latency, supporting targeting and gradual rollout, while maintaining auditability, consistency, and fast rollback.


3️⃣ Core Concepts


Feature Flag

A flag controls whether a feature is enabled.

Example:

new_checkout_enabled = true

Environment

Flags usually differ by environment:

dev
qa
staging
production

Targeting Rule

Example:

enable for enterprise tenants in US region

Percentage Rollout

Example:

enable for 10% of users

Kill Switch

A flag used to immediately disable risky behavior.


👉 Interview Answer

I would model feature flags as runtime configuration.

Each flag has environments, targeting rules, rollout percentage, default value, owner, and audit history.

The evaluation engine decides the final value for each request context.


4️⃣ Main APIs


Create Flag

POST /api/flags

Request:

{
  "key": "new_checkout_enabled",
  "description": "Enable new checkout experience",
  "owner": "checkout-team",
  "defaultValue": false
}

Update Flag Rules

PATCH /api/flags/{flagKey}/rules

Evaluate Flag

POST /api/flags/evaluate

Request:

{
  "flagKey": "new_checkout_enabled",
  "context": {
    "userId": "u123",
    "tenantId": "t456",
    "region": "US",
    "plan": "enterprise",
    "appVersion": "2.5.0"
  }
}

Get All Flags for SDK

GET /api/sdk/flags?environment=production

👉 Interview Answer

I would expose management APIs for creating and updating flags, and evaluation APIs for runtime usage.

In production, most evaluation should happen locally inside SDKs using cached flag rules, not by calling the flag service on every request.


5️⃣ Data Model


Feature Flag Table

feature_flag (
  flag_key VARCHAR PRIMARY KEY,
  name VARCHAR,
  description TEXT,
  owner_team VARCHAR,
  default_value JSON,
  flag_type VARCHAR,
  status VARCHAR,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
)

Flag Environment Config Table

flag_environment_config (
  flag_key VARCHAR,
  environment VARCHAR,
  enabled BOOLEAN,
  rules JSON,
  rollout_percentage INT,
  version BIGINT,
  updated_at TIMESTAMP,
  PRIMARY KEY (flag_key, environment)
)

Flag Rule Table

flag_rule (
  rule_id VARCHAR PRIMARY KEY,
  flag_key VARCHAR,
  environment VARCHAR,
  priority INT,
  condition JSON,
  variation JSON,
  created_at TIMESTAMP
)

Audit Log Table

flag_audit_log (
  audit_id VARCHAR PRIMARY KEY,
  flag_key VARCHAR,
  actor_id VARCHAR,
  action VARCHAR,
  old_value JSON,
  new_value JSON,
  reason TEXT,
  created_at TIMESTAMP
)

Evaluation Event Table

flag_evaluation_event (
  event_id VARCHAR PRIMARY KEY,
  flag_key VARCHAR,
  environment VARCHAR,
  user_id VARCHAR,
  tenant_id VARCHAR,
  variation JSON,
  evaluated_at TIMESTAMP
)

👉 Interview Answer

I would separate flag metadata, environment-specific config, targeting rules, audit logs, and evaluation events.

Flag metadata changes rarely, but environment config and rollout rules are what SDKs need for runtime evaluation.


6️⃣ High-Level Architecture


Admin UI
→ Feature Flag Management API
→ Flag Config Store
→ Audit Log

Flag Config Publisher
→ CDN / Edge Cache / Streaming Channel
→ SDK Cache

Application
→ Feature Flag SDK
→ Local Evaluation Engine
→ Feature Enabled / Disabled

Main Components

Management Control Plane


Config Distribution Layer


SDK


Evaluation Engine


👉 Interview Answer

I would separate control plane from data plane.

The control plane manages flag configuration.

The data plane is the SDK and evaluation engine inside application services.

This avoids adding network latency to every request.


7️⃣ Flag Evaluation Flow


Evaluation Context

Example:

{
  "userId": "u123",
  "tenantId": "t456",
  "region": "US",
  "plan": "enterprise",
  "device": "ios",
  "appVersion": "2.5.0"
}

Evaluation Steps

Load flag config
→ Check environment enabled
→ Apply kill switch
→ Evaluate targeting rules in priority order
→ Apply percentage rollout
→ Return matching variation
→ Fall back to default if no rule matches

Example Rule

{
  "if": {
    "plan": "enterprise",
    "region": "US"
  },
  "then": true
}

👉 Interview Answer

Flag evaluation should be deterministic.

The engine takes flag config and request context, checks targeting rules, applies rollout logic, and returns the final variation.

If evaluation fails, the SDK should return a safe default.


8️⃣ Percentage Rollout


Goal

Gradually release feature to users.

Example:

1% → 5% → 25% → 50% → 100%

Stable Bucketing

Use deterministic hashing:

hash(flag_key + user_id) % 100

If result < rollout percentage:

enabled

Why Stable?

The same user should consistently get the same flag value.


👉 Interview Answer

For percentage rollout, I would use deterministic hashing based on flag key and user ID.

This ensures stable bucketing, so the same user consistently sees the same experience while rollout percentage changes gradually.


9️⃣ Targeting Rules


Common Targeting Dimensions


Rule Priority

Rules should be evaluated in deterministic order.

Example:

1. Disable for blocked tenants
2. Enable for beta users
3. Enable 10% rollout
4. Default false

👉 Interview Answer

Targeting rules should be evaluated in priority order.

This allows explicit allowlists or blocklists to override percentage rollout.

Rule ordering must be clear and auditable.


🔟 SDK and Caching


Why SDK?

Applications need fast flag evaluation.

Calling remote service on every request causes:


SDK Responsibilities


Cache Strategy

Startup fetch
→ Local memory cache
→ Background refresh
→ Last known good config fallback

👉 Interview Answer

SDKs should evaluate flags locally using cached configuration.

They should periodically refresh config, support last-known-good fallback, and never block critical user requests on the flag service.


1️⃣1️⃣ Consistency and Propagation


Propagation Options

Polling

SDK fetches config every N seconds.

Pros:

Cons:


Streaming

Server pushes flag changes to SDKs.

Pros:

Cons:


CDN / Edge Config

SDK downloads config from CDN.

Pros:

Cons:


👉 Interview Answer

Flag propagation can use polling, streaming, or CDN-backed config distribution.

For most flags, propagation within seconds is acceptable.

For kill switches, I would use faster propagation and shorter cache TTLs.


1️⃣2️⃣ Kill Switch


Purpose

Quickly disable risky behavior.

Examples:


Requirements


👉 Interview Answer

Kill switches are safety controls.

They should be designed for fast propagation, high availability, and safe fallback.

The system should allow authorized operators to quickly disable risky features during incidents.


1️⃣3️⃣ A/B Testing and Experiments


Feature Flag vs Experiment

Feature flag:

turn feature on/off

Experiment:

assign users to variants and measure outcome

Experiment Variants

{
  "control": "old_ui",
  "variant_a": "new_ui_v1",
  "variant_b": "new_ui_v2"
}

Important Rule

Experiment assignment must be stable.

Use:

hash(experiment_key + user_id)

👉 Interview Answer

Feature flags can support experimentation, but experiments require stable assignment, exposure logging, and metrics analysis.

The system must record which variant the user saw, so outcomes can be attributed correctly.


1️⃣4️⃣ Audit and Governance


Why Needed?

Flag changes can impact production behavior instantly.


Audit Fields


Governance Controls


👉 Interview Answer

Feature flags are production controls, so every change must be audited.

For production environments, I would support approval workflows, ownership, change reason, rollback history, and stale flag cleanup.


1️⃣5️⃣ Stale Flag Cleanup


Problem

Old flags accumulate.

Risks:


Strategy

Track:

owner
created_at
last_evaluated_at
expiration_date
status

Cleanup Flow

Detect stale flag
→ Notify owner
→ Create cleanup ticket
→ Remove flag from code
→ Delete flag config

👉 Interview Answer

Stale flags are technical debt.

I would require each flag to have an owner, purpose, and expiration date.

The system should detect unused flags and notify owners to remove them from code and configuration.


1️⃣6️⃣ Security


Risks


Controls


👉 Interview Answer

Feature flags are not a replacement for authorization.

Sensitive decisions should be evaluated server-side, and secrets should never be stored in flag values.

Production flag changes should require proper permissions and audit logs.


1️⃣7️⃣ Scaling Patterns


Pattern 1: Local Evaluation

Avoid remote calls on every request.


Pattern 2: Config Distribution via CDN

Highly available global config delivery.


Pattern 3: Streaming for Critical Updates

Fast propagation for kill switches.


Pattern 4: Event-driven Audit and Analytics

Flag changes and evaluations emit events.


Pattern 5: Tenant-aware Flag Partitioning

Large enterprise tenants can have dedicated configs.


👉 Interview Answer

To scale feature flags, I would rely on local SDK evaluation, distribute configs through cache or CDN, use streaming for critical updates, and process evaluation events asynchronously.


1️⃣8️⃣ Failure Handling


Common Failures


Strategies


👉 Interview Answer

Applications should not fail because the flag service is down.

SDKs should use last-known-good config and safe defaults.

Bad config should be prevented through validation, and rollback should be fast and auditable.


1️⃣9️⃣ Consistency Model


Stronger Consistency Needed For


Eventual Consistency Acceptable For


👉 Interview Answer

Feature flag systems use mixed consistency.

Flag changes and audit logs need stronger correctness.

Runtime propagation can often be eventually consistent, but kill switches and security-sensitive flags need faster and safer propagation.


2️⃣0️⃣ Observability


Key Metrics


👉 Interview Answer

I would monitor evaluation latency, SDK refresh success, propagation delay, evaluation errors, stale flags, kill switch usage, and experiment exposure counts.

These metrics show whether the flag system is safe and reliable.


2️⃣1️⃣ End-to-End Flow


Runtime Evaluation Flow

Application starts
→ SDK downloads flag config
→ Request arrives
→ App builds evaluation context
→ SDK evaluates flag locally
→ App uses enabled/disabled behavior
→ SDK emits evaluation event asynchronously

Rollout Flow

Engineer creates flag
→ Adds targeting rule
→ Publishes to staging
→ Validates behavior
→ Requests production approval
→ Rolls out 1%, 5%, 25%, 50%, 100%
→ Monitors metrics

Kill Switch Flow

Incident detected
→ Operator disables flag
→ Config update published
→ SDKs refresh or receive push
→ Feature disabled
→ Audit event recorded

Key Insight

Feature Flag System is not just runtime if/else — it is a safe configuration, rollout, experimentation, and incident-control platform.


🧠 Staff-Level Answer (Final)


👉 Interview Answer (Full Version)

When designing a feature flag system, I think of it as a runtime control plane for product and engineering behavior.

The system allows teams to enable or disable features, target specific users or tenants, gradually roll out changes, run experiments, and quickly disable risky behavior during incidents.

I would separate the control plane from the data plane.

The control plane includes the admin UI, management APIs, configuration store, approval workflow, and audit logs.

The data plane is the SDK and local evaluation engine running inside applications.

Runtime evaluation should usually happen locally using cached flag configuration, because calling a remote flag service on every request adds latency and creates availability risk.

The evaluation engine takes request context, such as user ID, tenant ID, region, plan, app version, and device, then evaluates targeting rules, percentage rollout, and defaults to return a final variation.

Percentage rollout should use deterministic hashing so users are assigned to stable buckets.

For experiments, the system must provide stable assignment and log exposure events so metrics can be analyzed correctly.

Safety is critical. Production flag changes should be audited, high-risk changes may require approval, and kill switches should propagate quickly.

SDKs should use last-known-good config and safe defaults if the flag service is unavailable.

The main trade-offs are consistency, latency, safety, operational complexity, and governance.

Ultimately, the goal is to let teams ship safely, roll out gradually, experiment reliably, and recover quickly from production issues.


⭐ Final Insight

Feature Flag System 的核心不是简单的 if/else, 而是一个支持安全发布、灰度 rollout、实验分析、快速回滚 和生产治理的 runtime control plane。



中文部分


🎯 Design Feature Flag System


1️⃣ 核心框架

在设计 Feature Flag System 时,我通常从以下几个方面分析:

  1. Feature flag data model
  2. Targeting and rollout rules
  3. Flag evaluation engine
  4. SDK and caching strategy
  5. Control plane 和 data plane 分离
  6. Audit、approval 和 governance
  7. Kill switch 和 incident rollback
  8. 核心权衡:consistency vs latency vs safety

2️⃣ 核心需求


功能需求


非功能需求


👉 面试回答

Feature Flag System 用来在不重新部署代码的情况下, 动态控制功能行为。

核心挑战是低延迟、安全地评估 flags, 支持 targeting 和 gradual rollout, 同时保持 auditability、consistency 和快速 rollback 能力。


3️⃣ 核心概念


Feature Flag

Feature flag 控制某个功能是否启用。

示例:

new_checkout_enabled = true

Environment

Flags 通常按 environment 区分:

dev
qa
staging
production

Targeting Rule

示例:

enable for enterprise tenants in US region

Percentage Rollout

示例:

enable for 10% of users

Kill Switch

用于立即关闭高风险功能的 flag。


👉 面试回答

我会将 feature flags 建模成 runtime configuration。

每个 flag 包含 environments、targeting rules、 rollout percentage、default value、owner 和 audit history。

Evaluation engine 根据 request context 决定最终返回的 flag value。


4️⃣ Main APIs


Create Flag

POST /api/flags

Request:

{
  "key": "new_checkout_enabled",
  "description": "Enable new checkout experience",
  "owner": "checkout-team",
  "defaultValue": false
}

Update Flag Rules

PATCH /api/flags/{flagKey}/rules

Evaluate Flag

POST /api/flags/evaluate

Request:

{
  "flagKey": "new_checkout_enabled",
  "context": {
    "userId": "u123",
    "tenantId": "t456",
    "region": "US",
    "plan": "enterprise",
    "appVersion": "2.5.0"
  }
}

Get All Flags for SDK

GET /api/sdk/flags?environment=production

👉 面试回答

我会提供 management APIs 来创建和更新 flags, 也提供 evaluation APIs 给 runtime 使用。

在 production 中, 大多数 evaluation 应该在 SDK 内部用 cached flag rules 本地完成, 而不是每个请求都调用远程 flag service。


5️⃣ 数据模型


Feature Flag Table

feature_flag (
  flag_key VARCHAR PRIMARY KEY,
  name VARCHAR,
  description TEXT,
  owner_team VARCHAR,
  default_value JSON,
  flag_type VARCHAR,
  status VARCHAR,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
)

Flag Environment Config Table

flag_environment_config (
  flag_key VARCHAR,
  environment VARCHAR,
  enabled BOOLEAN,
  rules JSON,
  rollout_percentage INT,
  version BIGINT,
  updated_at TIMESTAMP,
  PRIMARY KEY (flag_key, environment)
)

Flag Rule Table

flag_rule (
  rule_id VARCHAR PRIMARY KEY,
  flag_key VARCHAR,
  environment VARCHAR,
  priority INT,
  condition JSON,
  variation JSON,
  created_at TIMESTAMP
)

Audit Log Table

flag_audit_log (
  audit_id VARCHAR PRIMARY KEY,
  flag_key VARCHAR,
  actor_id VARCHAR,
  action VARCHAR,
  old_value JSON,
  new_value JSON,
  reason TEXT,
  created_at TIMESTAMP
)

Evaluation Event Table

flag_evaluation_event (
  event_id VARCHAR PRIMARY KEY,
  flag_key VARCHAR,
  environment VARCHAR,
  user_id VARCHAR,
  tenant_id VARCHAR,
  variation JSON,
  evaluated_at TIMESTAMP
)

👉 面试回答

我会将 flag metadata、environment-specific config、 targeting rules、audit logs 和 evaluation events 分开。

Flag metadata 很少变化; environment config 和 rollout rules 才是 SDK runtime evaluation 需要的数据。


6️⃣ High-Level Architecture


Admin UI
→ Feature Flag Management API
→ Flag Config Store
→ Audit Log

Flag Config Publisher
→ CDN / Edge Cache / Streaming Channel
→ SDK Cache

Application
→ Feature Flag SDK
→ Local Evaluation Engine
→ Feature Enabled / Disabled

Main Components

Management Control Plane


Config Distribution Layer


SDK


Evaluation Engine


👉 面试回答

我会将 control plane 和 data plane 分开。

Control plane 负责管理 flag configuration。

Data plane 是 application service 内部的 SDK 和 evaluation engine。

这样可以避免每个请求都增加远程网络延迟。


7️⃣ Flag Evaluation Flow


Evaluation Context

示例:

{
  "userId": "u123",
  "tenantId": "t456",
  "region": "US",
  "plan": "enterprise",
  "device": "ios",
  "appVersion": "2.5.0"
}

Evaluation Steps

Load flag config
→ Check environment enabled
→ Apply kill switch
→ Evaluate targeting rules in priority order
→ Apply percentage rollout
→ Return matching variation
→ Fall back to default if no rule matches

Example Rule

{
  "if": {
    "plan": "enterprise",
    "region": "US"
  },
  "then": true
}

👉 面试回答

Flag evaluation 应该是 deterministic 的。

Engine 接收 flag config 和 request context, 检查 targeting rules, 应用 rollout logic, 然后返回最终 variation。

如果 evaluation 失败, SDK 应该返回 safe default。


8️⃣ Percentage Rollout


Goal

逐步发布功能。

示例:

1% → 5% → 25% → 50% → 100%

Stable Bucketing

使用 deterministic hashing:

hash(flag_key + user_id) % 100

如果结果小于 rollout percentage:

enabled

Why Stable?

同一个用户应该稳定获得同一个 flag value。


👉 面试回答

对 percentage rollout, 我会使用基于 flag key 和 user ID 的 deterministic hashing。

这样可以保证 stable bucketing, 同一个用户会稳定看到同一种体验, 同时 rollout percentage 可以逐步扩大。


9️⃣ Targeting Rules


Common Targeting Dimensions


Rule Priority

Rules 应该按确定顺序执行。

示例:

1. Disable for blocked tenants
2. Enable for beta users
3. Enable 10% rollout
4. Default false

👉 面试回答

Targeting rules 应该按 priority order 执行。

这样 explicit allowlists 或 blocklists 可以覆盖 percentage rollout。

Rule ordering 必须清晰并可审计。


🔟 SDK and Caching


为什么需要 SDK?

Applications 需要快速 flag evaluation。

每个请求调用远程 service 会造成:


SDK Responsibilities


Cache Strategy

Startup fetch
→ Local memory cache
→ Background refresh
→ Last known good config fallback

👉 面试回答

SDK 应该使用 cached configuration 本地评估 flags。

它应该定期 refresh config, 支持 last-known-good fallback, 并且绝不应该让关键用户请求阻塞在 flag service 上。


1️⃣1️⃣ Consistency and Propagation


Propagation Options

Polling

SDK 每 N 秒拉取 config。

优点:

缺点:


Streaming

Server 推送 flag changes 到 SDKs。

优点:

缺点:


CDN / Edge Config

SDK 从 CDN 下载 config。

优点:

缺点:


👉 面试回答

Flag propagation 可以使用 polling、streaming 或 CDN-backed config distribution。

对大多数 flags, 几秒内 propagation 是可以接受的。

对 kill switches, 我会使用更快 propagation 和更短 cache TTL。


1️⃣2️⃣ Kill Switch


Purpose

快速关闭高风险行为。

示例:


Requirements


👉 面试回答

Kill switches 是安全控制。

它们应该支持 fast propagation、 high availability 和 safe fallback。

系统应该允许授权 operator 在 incident 期间快速关闭高风险功能。


1️⃣3️⃣ A/B Testing and Experiments


Feature Flag vs Experiment

Feature flag:

turn feature on/off

Experiment:

assign users to variants and measure outcome

Experiment Variants

{
  "control": "old_ui",
  "variant_a": "new_ui_v1",
  "variant_b": "new_ui_v2"
}

Important Rule

Experiment assignment 必须稳定。

使用:

hash(experiment_key + user_id)

👉 面试回答

Feature flags 可以支持 experimentation, 但 experiments 需要 stable assignment、 exposure logging 和 metrics analysis。

系统必须记录用户看到的 variant, 这样才能正确归因 outcomes。


1️⃣4️⃣ Audit and Governance


Why Needed?

Flag changes 可以立即影响 production behavior。


Audit Fields


Governance Controls


👉 面试回答

Feature flags 是 production controls, 所以每次 change 都必须 audit。

对 production environments, 我会支持 approval workflow、ownership、 change reason、rollback history 和 stale flag cleanup。


1️⃣5️⃣ Stale Flag Cleanup


Problem

旧 flags 会不断堆积。

风险:


Strategy

追踪:

owner
created_at
last_evaluated_at
expiration_date
status

Cleanup Flow

Detect stale flag
→ Notify owner
→ Create cleanup ticket
→ Remove flag from code
→ Delete flag config

👉 面试回答

Stale flags 是 technical debt。

我会要求每个 flag 都有 owner、purpose 和 expiration date。

系统应该检测 unused flags, 并通知 owners 从代码和配置中移除它们。


1️⃣6️⃣ Security


Risks


Controls


👉 面试回答

Feature flags 不是 authorization 的替代品。

Sensitive decisions 应该 server-side evaluate, secrets 永远不应该存储在 flag values 中。

Production flag changes 必须有正确权限 和 audit logs。


1️⃣7️⃣ Scaling Patterns


Pattern 1: Local Evaluation

避免每个 request 做 remote calls。


Pattern 2: Config Distribution via CDN

全球高可用分发 config。


Pattern 3: Streaming for Critical Updates

用于 kill switches 的快速 propagation。


Pattern 4: Event-driven Audit and Analytics

Flag changes 和 evaluations 产生 events。


Pattern 5: Tenant-aware Flag Partitioning

大型 enterprise tenants 可以有 dedicated configs。


👉 面试回答

为了扩展 feature flags, 我会依赖 SDK local evaluation, 通过 cache 或 CDN 分发 configs, 对 critical updates 使用 streaming, 并异步处理 evaluation events。


1️⃣8️⃣ Failure Handling


Common Failures


Strategies


👉 面试回答

Application 不应该因为 flag service down 而失败。

SDKs 应该使用 last-known-good config 和 safe defaults。

Bad config 应该通过 validation 预防, 并且 rollback 必须快速且可审计。


1️⃣9️⃣ Consistency Model


需要较强一致性的场景


可以最终一致的场景


👉 面试回答

Feature flag system 使用 mixed consistency。

Flag changes 和 audit logs 需要更强正确性。

Runtime propagation 通常可以最终一致, 但 kill switches 和 security-sensitive flags 需要更快且更安全的 propagation。


2️⃣0️⃣ Observability


Key Metrics


👉 面试回答

我会监控 evaluation latency、SDK refresh success、 propagation delay、evaluation errors、stale flags、 kill switch usage 和 experiment exposure counts。

这些指标可以说明 flag system 是否安全可靠。


2️⃣1️⃣ End-to-End Flow


Runtime Evaluation Flow

Application starts
→ SDK downloads flag config
→ Request arrives
→ App builds evaluation context
→ SDK evaluates flag locally
→ App uses enabled/disabled behavior
→ SDK emits evaluation event asynchronously

Rollout Flow

Engineer creates flag
→ Adds targeting rule
→ Publishes to staging
→ Validates behavior
→ Requests production approval
→ Rolls out 1%, 5%, 25%, 50%, 100%
→ Monitors metrics

Kill Switch Flow

Incident detected
→ Operator disables flag
→ Config update published
→ SDKs refresh or receive push
→ Feature disabled
→ Audit event recorded

Key Insight

Feature Flag System 不是 runtime if/else, 而是 safe configuration、rollout、experimentation 和 incident-control platform。


🧠 Staff-Level Answer(最终版)


👉 面试回答(完整背诵版)

在设计 Feature Flag System 时, 我会把它看作一个 runtime control plane, 用来控制产品和工程行为。

系统允许团队 enable 或 disable features, target 特定 users 或 tenants, gradual rollout changes, 运行 experiments, 并在 incidents 中快速关闭高风险行为。

我会将 control plane 和 data plane 分离。

Control plane 包括 admin UI、management APIs、 configuration store、approval workflow 和 audit logs。

Data plane 是运行在 applications 中的 SDK 和 local evaluation engine。

Runtime evaluation 通常应该本地完成, 使用 cached flag configuration, 因为每个请求调用远程 flag service 会增加 latency 并引入 availability risk。

Evaluation engine 接收 request context, 例如 user ID、tenant ID、region、plan、 app version 和 device, 然后评估 targeting rules、percentage rollout 和 default value,最终返回 variation。

Percentage rollout 应该使用 deterministic hashing, 这样用户会被稳定分配到固定 bucket。

对 experiments, 系统必须提供 stable assignment, 并记录 exposure events, 这样 metrics 才能正确分析。

Safety 非常关键。 Production flag changes 必须 audit, 高风险 changes 可能需要 approval, kill switches 应该快速传播。

如果 flag service 不可用, SDKs 应该使用 last-known-good config 和 safe defaults。

核心权衡包括 consistency、latency、safety、 operational complexity 和 governance。

最终目标是让 teams 可以安全发布、 逐步 rollout、可靠实验, 并在 production issue 出现时快速恢复。


⭐ Final Insight

Feature Flag System 的核心不是简单的 if/else, 而是一个支持安全发布、灰度 rollout、实验分析、快速回滚 和生产治理的 runtime control plane。

Implement