·

System Design Deep Dive - 03 Token-based Billing Systems Design

Post by ailswan May. 24, 2026

中文 ↓

🎯 Token-based Billing Systems Design


1️⃣ Core Framework

When discussing Token-based Billing Systems, I frame it as:

  1. Why token-based billing exists
  2. Token metering
  3. Request lifecycle accounting
  4. Pricing and rating engine
  5. Usage aggregation
  6. Quota and budget enforcement
  7. Billing correctness and idempotency
  8. Trade-offs: accuracy vs latency vs cost

2️⃣ Why Token-based Billing Exists

LLM cost is closely tied to token usage.

A request usually consumes:


Basic Billing Flow

User Request
→ Count Tokens
→ Run Inference
→ Count Output Tokens
→ Calculate Cost
→ Store Usage Record
→ Aggregate Bill

👉 Interview Answer

Token-based billing is used because LLM infrastructure cost is strongly related to token usage.

The system meters input tokens, output tokens, model type, cached tokens, and other usage dimensions, then converts usage into cost using a pricing engine.


3️⃣ What Is a Token?


Token Definition

A token is a unit of text processed by the model.

It can be:


Example

"Hello world"
→ ["Hello", " world"]

Why It Matters

Token count affects:


👉 Interview Answer

A token is the unit of text that the model processes.

Billing is based on tokens because token count directly affects compute, latency, memory, and cost.


4️⃣ High-Level Architecture


Architecture

Client
→ API Gateway
→ Auth / Quota Check
→ Request Validator
→ Token Meter
→ Inference Service
→ Usage Event Producer
→ Usage Event Stream
→ Aggregation Service
→ Rating Engine
→ Invoice / Billing System

Core Components

Token Meter

Counts input and output tokens.


Usage Event Producer

Emits usage events after request completion.


Aggregation Service

Aggregates usage by user, organization, model, and time window.


Rating Engine

Converts usage into cost.


Billing System

Generates invoices or charges.


👉 Interview Answer

A token billing system usually includes token metering, usage event generation, streaming ingestion, aggregation, rating, quota enforcement, and invoice generation.

The key requirement is accurate, reliable, and auditable usage accounting.


5️⃣ Token Metering


What to Meter

The system should meter:


Example Usage Record

{
  "request_id": "req_123",
  "organization_id": "org_456",
  "model": "large-model",
  "input_tokens": 1200,
  "output_tokens": 300,
  "cached_tokens": 600,
  "timestamp": "2026-05-24T10:00:00Z"
}

👉 Interview Answer

Token metering records how many tokens were consumed by each request.

A good usage record should include request ID, organization, model, input tokens, output tokens, cached tokens, timestamp, and pricing-relevant dimensions.


6️⃣ Request Lifecycle Accounting


Request Lifecycle

Request accepted
→ Input tokens counted
→ Inference starts
→ Output tokens generated
→ Final token count captured
→ Usage event emitted

Important Cases

The system must handle:


Example

Client disconnects after 100 output tokens
→ Bill only generated tokens
→ Emit partial usage event

👉 Interview Answer

Billing must follow the full request lifecycle.

The system should correctly account for successful, failed, cancelled, timed-out, retried, and partially streamed requests.


7️⃣ Pricing Model


Pricing Dimensions

Pricing may depend on:


Example Pricing

Input tokens: $X per million
Output tokens: $Y per million
Cached tokens: discounted

Why Output Tokens May Cost More

Output generation requires sequential decoding, which is often more expensive than processing input tokens.


👉 Interview Answer

The pricing model converts measured usage into cost.

Pricing may differ by model, input tokens, output tokens, cached tokens, batch mode, region, and customer tier.


8️⃣ Rating Engine


What Is Rating?

Rating converts usage into billable cost.

Usage Event
→ Pricing Rules
→ Cost

Rating Example

{
  "input_tokens": 1000,
  "output_tokens": 500,
  "input_price": 0.000001,
  "output_price": 0.000003,
  "total_cost": 0.0025
}

Rating Engine Responsibilities


👉 Interview Answer

The rating engine applies pricing rules to token usage.

It calculates cost, applies discounts, credits, free tier rules, and stores auditable cost records.


9️⃣ Usage Event Pipeline


Why Event Pipeline Is Needed

Billing data must be durable and reliable.


Event Flow

Inference Service
→ Usage Event
→ Durable Queue / Stream
→ Usage Processor
→ Aggregation Store
→ Billing System

Requirements


👉 Interview Answer

Usage events should flow through a durable event pipeline.

This allows reliable processing, retries, replay, aggregation, and auditing.

Billing events should never rely only on in-memory state.


🔟 Idempotency and Deduplication


Why It Matters

Requests may be retried.

Events may be delivered more than once.

Without deduplication, customers may be double-billed.


Idempotency Key

request_id + usage_event_type + attempt_id

Deduplication Flow

Usage event received
→ Check idempotency key
→ If already processed, skip
→ Else process and store

👉 Interview Answer

Idempotency is critical in billing systems.

Since requests and events may be retried, the system must deduplicate usage events using stable request IDs or idempotency keys to avoid double charging.


1️⃣1️⃣ Quota and Budget Enforcement


Why Quotas Matter

Users and organizations need spending control.


Common Controls


Flow

Request arrives
→ Estimate token usage
→ Check quota / budget
→ Allow or reject

Challenge

Output tokens are unknown before generation.


👉 Interview Answer

Quota enforcement protects both the platform and customers.

The system should check estimated usage before inference and reconcile actual usage after completion.

Output token cost is harder because it is only known after generation.


1️⃣2️⃣ Prepaid Credits vs Postpaid Billing


Prepaid

Users buy credits first.

Credit balance
→ Deduct usage
→ Stop when balance is low

Advantages


Postpaid

Users are billed after usage.

Usage accumulated
→ Monthly invoice

Advantages


👉 Interview Answer

Token billing systems can support prepaid or postpaid models.

Prepaid credits reduce financial risk and enforce hard limits.

Postpaid billing is common for enterprise customers with invoices, contracts, and volume discounts.


1️⃣3️⃣ Streaming Billing


Why Streaming Is Special

Output tokens are generated gradually.

The system may stream tokens before knowing final usage.


Streaming Billing Flow

Start request
→ Count input tokens
→ Stream output tokens
→ Count generated tokens
→ Emit final usage event

Cancellation Case

Client cancels after 200 tokens
→ Bill 200 generated tokens

👉 Interview Answer

Streaming responses require billing based on actual generated tokens.

The system should count output tokens as they are produced and emit a final usage event when the stream completes, fails, or is cancelled.


1️⃣4️⃣ Aggregation


Why Aggregate Usage?

Raw events are too detailed for invoices.

Aggregation summarizes usage by:


Example

org_123
model = large-model
date = 2026-05-24
input_tokens = 10M
output_tokens = 2M
cost = $X

👉 Interview Answer

Usage aggregation converts raw request-level usage events into billing summaries.

Aggregation usually groups by organization, model, token type, time window, API key, and billing period.


1️⃣5️⃣ Auditability


Why Audit Matters

Customers may dispute bills.

The system must prove:


Audit Record

Request ID
Usage event
Token counts
Pricing rule version
Calculated cost
Processing timestamp

👉 Interview Answer

Billing systems must be auditable.

For each charge, the system should be able to trace the request, token counts, model, pricing rule, discounts, and processing history.


1️⃣6️⃣ Best Practices


Practical Rules


Design Principle

Meter first.
Rate later.
Bill with auditability.

👉 Interview Answer

A good token-based billing system separates metering, rating, aggregation, and invoicing.

It should use durable events, idempotency, audit logs, quota enforcement, and pricing-rule versioning.


🧠 Staff-Level Answer Final


👉 Interview Answer Full Version

Token-based billing is used for LLM systems because infrastructure cost is strongly related to token usage.

A production billing system needs to meter input tokens, output tokens, cached tokens, embedding tokens, model type, organization, timestamp, and other pricing dimensions.

The high-level architecture usually includes an API gateway, quota checker, token meter, inference service, usage event producer, durable event stream, usage processor, aggregation store, rating engine, and billing or invoicing system.

During the request lifecycle, the system counts input tokens before inference, tracks output tokens during generation, and emits a usage event when the request completes, fails, times out, or is cancelled.

Streaming responses require special care because output tokens are produced gradually.

The system should bill based on the actual number of generated tokens, even if the client disconnects early.

The usage event pipeline must be durable and replayable.

Billing should not depend only on in-memory state.

Since retries and duplicate events can happen, idempotency and deduplication are critical to avoid double billing.

I would separate metering from rating.

Metering records raw usage.

Rating applies pricing rules, discounts, credits, free tier rules, currency handling, and billing-period logic.

Quota and budget enforcement should happen before inference using estimated usage, then actual usage should be reconciled after completion.

Because output tokens are unknown in advance, the system may reserve budget upfront and settle the final amount later.

Finally, auditability is essential.

Every charge should be traceable back to request ID, token counts, model, pricing rule version, discounts, and processing timestamps.

The core principle is: meter first, rate later, and bill with auditability.


⭐ Final Insight

Token-based Billing 的核心不是简单地:

“数一下 token,然后乘价格”

真正的 production billing system 是:

Token Metering

  • Usage Events
  • Durable Stream
  • Idempotency
  • Aggregation
  • Rating Engine
  • Quota Enforcement
  • Audit Logs
  • Invoice Generation。

最重要的原则是:

Meter first.

Rate later.

Bill with auditability.


中文部分


🎯 Token-based Billing Systems Design


1️⃣ 核心框架

讨论 Token-based Billing Systems 时,我通常从这些方面分析:

  1. 为什么需要 token-based billing
  2. Token metering
  3. Request lifecycle accounting
  4. Pricing and rating engine
  5. Usage aggregation
  6. Quota and budget enforcement
  7. Billing correctness and idempotency
  8. 核心权衡:accuracy vs latency vs cost

2️⃣ 为什么需要 Token-based Billing?

LLM cost 和 token usage 高度相关。

一个 request 通常消耗:


Basic Billing Flow

User Request
→ Count Tokens
→ Run Inference
→ Count Output Tokens
→ Calculate Cost
→ Store Usage Record
→ Aggregate Bill

👉 面试回答

Token-based billing 被使用, 是因为 LLM infrastructure cost 和 token usage 强相关。

系统会计量 input tokens、output tokens、 model type、cached tokens 和其他 usage dimensions, 然后通过 pricing engine 转换成 cost。


3️⃣ 什么是 Token?


Token Definition

Token 是 model 处理文本的基本单位。

它可能是:


Example

"Hello world"
→ ["Hello", " world"]

为什么重要?

Token count 影响:


👉 面试回答

Token 是 model 处理文本的基本单位。

Billing 基于 tokens, 因为 token count 直接影响 compute、 latency、memory 和 cost。


4️⃣ High-Level Architecture


Architecture

Client
→ API Gateway
→ Auth / Quota Check
→ Request Validator
→ Token Meter
→ Inference Service
→ Usage Event Producer
→ Usage Event Stream
→ Aggregation Service
→ Rating Engine
→ Invoice / Billing System

Core Components

Token Meter

计算 input 和 output tokens。


Usage Event Producer

Request 完成后产生 usage events。


Aggregation Service

按 user、organization、model 和 time window 聚合 usage。


Rating Engine

把 usage 转换成 cost。


Billing System

生成 invoice 或 charge。


👉 面试回答

Token billing system 通常包括 token metering、 usage event generation、streaming ingestion、 aggregation、rating、quota enforcement 和 invoice generation。

关键要求是 usage accounting 必须 accurate、reliable 和 auditable。


5️⃣ Token Metering


需要计量什么?

系统应该计量:


Example Usage Record

{
  "request_id": "req_123",
  "organization_id": "org_456",
  "model": "large-model",
  "input_tokens": 1200,
  "output_tokens": 300,
  "cached_tokens": 600,
  "timestamp": "2026-05-24T10:00:00Z"
}

👉 面试回答

Token metering 会记录每个 request 消耗了多少 tokens。

好的 usage record 应该包含 request ID、 organization、model、input tokens、 output tokens、cached tokens、timestamp 和 pricing-relevant dimensions。


6️⃣ Request Lifecycle Accounting


Request Lifecycle

Request accepted
→ Input tokens counted
→ Inference starts
→ Output tokens generated
→ Final token count captured
→ Usage event emitted

Important Cases

系统必须处理:


Example

Client disconnects after 100 output tokens
→ Bill only generated tokens
→ Emit partial usage event

👉 面试回答

Billing 必须跟随完整 request lifecycle。

系统需要正确处理 successful、failed、 cancelled、timed-out、retried 和 partially streamed requests。


7️⃣ Pricing Model


Pricing Dimensions

Pricing 可能取决于:


Example Pricing

Input tokens: $X per million
Output tokens: $Y per million
Cached tokens: discounted

为什么 Output Tokens 可能更贵?

Output generation 需要 sequential decoding, 通常比处理 input tokens 更昂贵。


👉 面试回答

Pricing model 把 measured usage 转换成 cost。

Pricing 可能根据 model、input tokens、 output tokens、cached tokens、batch mode、 region 和 customer tier 不同而变化。


8️⃣ Rating Engine


什么是 Rating?

Rating 把 usage 转换成 billable cost。

Usage Event
→ Pricing Rules
→ Cost

Rating Example

{
  "input_tokens": 1000,
  "output_tokens": 500,
  "input_price": 0.000001,
  "output_price": 0.000003,
  "total_cost": 0.0025
}

Rating Engine Responsibilities


👉 面试回答

Rating engine 会把 pricing rules 应用到 token usage 上。

它负责计算 cost, 应用 discounts、credits、 free tier rules, 并存储 auditable cost records。


9️⃣ Usage Event Pipeline


为什么需要 Event Pipeline?

Billing data 必须 durable and reliable。


Event Flow

Inference Service
→ Usage Event
→ Durable Queue / Stream
→ Usage Processor
→ Aggregation Store
→ Billing System

Requirements


👉 面试回答

Usage events 应该进入 durable event pipeline。

这支持 reliable processing、retries、 replay、aggregation 和 auditing。

Billing events 不应该只依赖 in-memory state。


🔟 Idempotency and Deduplication


为什么重要?

Requests 可能 retry。

Events 可能被重复 deliver。

如果没有 deduplication, customers 可能被 double-billed。


Idempotency Key

request_id + usage_event_type + attempt_id

Deduplication Flow

Usage event received
→ Check idempotency key
→ If already processed, skip
→ Else process and store

👉 面试回答

Idempotency 对 billing systems 非常关键。

因为 requests 和 events 都可能 retry, 系统必须使用 stable request IDs 或 idempotency keys 去重 usage events, 避免 double charging。


1️⃣1️⃣ Quota and Budget Enforcement


为什么 Quotas 重要?

Users 和 organizations 需要控制 spending。


Common Controls


Flow

Request arrives
→ Estimate token usage
→ Check quota / budget
→ Allow or reject

Challenge

Output tokens 在 generation 前未知。


👉 面试回答

Quota enforcement 保护 platform 和 customers。

系统应该在 inference 前 基于 estimated usage 做检查, 并在 completion 后用 actual usage reconcile。

Output token cost 更难处理, 因为 generation 后才知道。


1️⃣2️⃣ Prepaid Credits vs Postpaid Billing


Prepaid

Users 先购买 credits。

Credit balance
→ Deduct usage
→ Stop when balance is low

Advantages


Postpaid

Users 先使用,之后付费。

Usage accumulated
→ Monthly invoice

Advantages


👉 面试回答

Token billing systems 可以支持 prepaid 或 postpaid models。

Prepaid credits 降低 financial risk, 并支持 hard limits。

Postpaid billing 常用于 enterprise customers, 支持 invoices、contracts 和 volume discounts。


1️⃣3️⃣ Streaming Billing


为什么 Streaming 特殊?

Output tokens 是逐渐生成的。

系统可能在知道最终 usage 前 已经开始 stream tokens。


Streaming Billing Flow

Start request
→ Count input tokens
→ Stream output tokens
→ Count generated tokens
→ Emit final usage event

Cancellation Case

Client cancels after 200 tokens
→ Bill 200 generated tokens

👉 面试回答

Streaming responses 需要基于实际 generated tokens 计费。

系统应该在 tokens 生成时计数, 并在 stream completed、failed 或 cancelled 时发送 final usage event。


1️⃣4️⃣ Aggregation


为什么要 Aggregate Usage?

Raw events 对 invoices 来说太细。

Aggregation 按以下维度汇总 usage:


Example

org_123
model = large-model
date = 2026-05-24
input_tokens = 10M
output_tokens = 2M
cost = $X

👉 面试回答

Usage aggregation 把 request-level raw events 转换成 billing summaries。

Aggregation 通常按 organization、model、 token type、time window、API key 和 billing period 分组。


1️⃣5️⃣ Auditability


为什么 Audit 重要?

Customers 可能 dispute bills。

系统必须证明:


Audit Record

Request ID
Usage event
Token counts
Pricing rule version
Calculated cost
Processing timestamp

👉 面试回答

Billing systems 必须 auditable。

对每一笔 charge, 系统都应该能追踪 request、token counts、 model、pricing rule、discounts 和 processing history。


1️⃣6️⃣ Best Practices


Practical Rules


Design Principle

Meter first.
Rate later.
Bill with auditability.

👉 面试回答

好的 token-based billing system 会分离 metering、rating、aggregation 和 invoicing。

它应该使用 durable events、idempotency、 audit logs、quota enforcement 和 pricing-rule versioning。


🧠 Staff-Level Answer Final


👉 面试回答完整版本

Token-based billing 用于 LLM systems, 是因为 infrastructure cost 和 token usage 高度相关。

Production billing system 需要计量 input tokens、output tokens、 cached tokens、embedding tokens、 model type、organization、timestamp 和其他 pricing dimensions。

High-level architecture 通常包含 API gateway、quota checker、token meter、 inference service、usage event producer、 durable event stream、usage processor、 aggregation store、rating engine 和 billing 或 invoicing system。

在 request lifecycle 中, 系统会在 inference 前计算 input tokens, 在 generation 期间追踪 output tokens, 并在 request completed、failed、timed out 或 cancelled 时发送 usage event。

Streaming responses 需要特别处理, 因为 output tokens 是逐渐产生的。

即使 client early disconnect, 系统也应该根据实际生成的 tokens 计费。

Usage event pipeline 必须 durable and replayable。

Billing 不应该只依赖 in-memory state。

因为 retries 和 duplicate events 可能发生, idempotency 和 deduplication 非常关键, 用来避免 double billing。

我会把 metering 和 rating 分开。

Metering 记录 raw usage。

Rating 应用 pricing rules、discounts、 credits、free tier rules、currency handling 和 billing-period logic。

Quota 和 budget enforcement 应该在 inference 前基于 estimated usage 执行, 然后在 completion 后用 actual usage reconcile。

因为 output tokens 事前未知, 系统可以 upfront reserve budget, 之后再 settle final amount。

最后,auditability 非常重要。

每一笔 charge 都应该能追溯到 request ID、 token counts、model、pricing rule version、 discounts 和 processing timestamps。

核心原则是: meter first, rate later, and bill with auditability。


⭐ Final Insight

Token-based Billing 的核心不是简单地:

“数一下 token,然后乘价格”

真正的 production billing system 是:

Token Metering

  • Usage Events
  • Durable Stream
  • Idempotency
  • Aggregation
  • Rating Engine
  • Quota Enforcement
  • Audit Logs
  • Invoice Generation。

最重要的原则是:

Meter first.

Rate later.

Bill with auditability.


Implement