🎯 Token-based Billing Systems Design
1️⃣ Core Framework
When discussing Token-based Billing Systems, I frame it as:
- Why token-based billing exists
- Token metering
- Request lifecycle accounting
- Pricing and rating engine
- Usage aggregation
- Quota and budget enforcement
- Billing correctness and idempotency
- Trade-offs: accuracy vs latency vs cost
2️⃣ Why Token-based Billing Exists
LLM cost is closely tied to token usage.
A request usually consumes:
- Input tokens
- Output tokens
- Cached tokens
- Tool-call tokens
- Embedding tokens
- Reasoning tokens
- Multimodal tokens
Basic Billing Flow
User Request
→ Count Tokens
→ Run Inference
→ Count Output Tokens
→ Calculate Cost
→ Store Usage Record
→ Aggregate Bill
👉 Interview Answer
Token-based billing is used because LLM infrastructure cost is strongly related to token usage.
The system meters input tokens, output tokens, model type, cached tokens, and other usage dimensions, then converts usage into cost using a pricing engine.
3️⃣ What Is a Token?
Token Definition
A token is a unit of text processed by the model.
It can be:
- A word
- Part of a word
- Punctuation
- Whitespace
- Special token
Example
"Hello world"
→ ["Hello", " world"]
Why It Matters
Token count affects:
- Latency
- Cost
- Context window usage
- GPU memory
- Inference time
👉 Interview Answer
A token is the unit of text that the model processes.
Billing is based on tokens because token count directly affects compute, latency, memory, and cost.
4️⃣ High-Level Architecture
Architecture
Client
→ API Gateway
→ Auth / Quota Check
→ Request Validator
→ Token Meter
→ Inference Service
→ Usage Event Producer
→ Usage Event Stream
→ Aggregation Service
→ Rating Engine
→ Invoice / Billing System
Core Components
Token Meter
Counts input and output tokens.
Usage Event Producer
Emits usage events after request completion.
Aggregation Service
Aggregates usage by user, organization, model, and time window.
Rating Engine
Converts usage into cost.
Billing System
Generates invoices or charges.
👉 Interview Answer
A token billing system usually includes token metering, usage event generation, streaming ingestion, aggregation, rating, quota enforcement, and invoice generation.
The key requirement is accurate, reliable, and auditable usage accounting.
5️⃣ Token Metering
What to Meter
The system should meter:
- Prompt input tokens
- Completion output tokens
- Cached input tokens
- Embedding tokens
- Tool-call tokens
- Model name
- Request ID
- Organization ID
- Timestamp
Example Usage Record
{
"request_id": "req_123",
"organization_id": "org_456",
"model": "large-model",
"input_tokens": 1200,
"output_tokens": 300,
"cached_tokens": 600,
"timestamp": "2026-05-24T10:00:00Z"
}
👉 Interview Answer
Token metering records how many tokens were consumed by each request.
A good usage record should include request ID, organization, model, input tokens, output tokens, cached tokens, timestamp, and pricing-relevant dimensions.
6️⃣ Request Lifecycle Accounting
Request Lifecycle
Request accepted
→ Input tokens counted
→ Inference starts
→ Output tokens generated
→ Final token count captured
→ Usage event emitted
Important Cases
The system must handle:
- Successful requests
- Failed requests
- Cancelled requests
- Streamed responses
- Partial outputs
- Retries
- Timeouts
Example
Client disconnects after 100 output tokens
→ Bill only generated tokens
→ Emit partial usage event
👉 Interview Answer
Billing must follow the full request lifecycle.
The system should correctly account for successful, failed, cancelled, timed-out, retried, and partially streamed requests.
7️⃣ Pricing Model
Pricing Dimensions
Pricing may depend on:
- Model
- Input tokens
- Output tokens
- Cached tokens
- Embedding tokens
- Batch vs real-time mode
- Region
- Customer tier
- Volume discounts
Example Pricing
Input tokens: $X per million
Output tokens: $Y per million
Cached tokens: discounted
Why Output Tokens May Cost More
Output generation requires sequential decoding, which is often more expensive than processing input tokens.
👉 Interview Answer
The pricing model converts measured usage into cost.
Pricing may differ by model, input tokens, output tokens, cached tokens, batch mode, region, and customer tier.
8️⃣ Rating Engine
What Is Rating?
Rating converts usage into billable cost.
Usage Event
→ Pricing Rules
→ Cost
Rating Example
{
"input_tokens": 1000,
"output_tokens": 500,
"input_price": 0.000001,
"output_price": 0.000003,
"total_cost": 0.0025
}
Rating Engine Responsibilities
- Apply pricing rules
- Apply discounts
- Apply free tier
- Apply credits
- Handle currency
- Handle billing periods
- Preserve audit trail
👉 Interview Answer
The rating engine applies pricing rules to token usage.
It calculates cost, applies discounts, credits, free tier rules, and stores auditable cost records.
9️⃣ Usage Event Pipeline
Why Event Pipeline Is Needed
Billing data must be durable and reliable.
Event Flow
Inference Service
→ Usage Event
→ Durable Queue / Stream
→ Usage Processor
→ Aggregation Store
→ Billing System
Requirements
- Durable delivery
- Idempotent processing
- Ordering where needed
- Replay support
- Dead-letter queue
- Audit logs
👉 Interview Answer
Usage events should flow through a durable event pipeline.
This allows reliable processing, retries, replay, aggregation, and auditing.
Billing events should never rely only on in-memory state.
🔟 Idempotency and Deduplication
Why It Matters
Requests may be retried.
Events may be delivered more than once.
Without deduplication, customers may be double-billed.
Idempotency Key
request_id + usage_event_type + attempt_id
Deduplication Flow
Usage event received
→ Check idempotency key
→ If already processed, skip
→ Else process and store
👉 Interview Answer
Idempotency is critical in billing systems.
Since requests and events may be retried, the system must deduplicate usage events using stable request IDs or idempotency keys to avoid double charging.
1️⃣1️⃣ Quota and Budget Enforcement
Why Quotas Matter
Users and organizations need spending control.
Common Controls
- Token-per-minute limit
- Request-per-minute limit
- Daily budget
- Monthly budget
- Model-specific quota
- Organization-level quota
- Hard and soft limits
Flow
Request arrives
→ Estimate token usage
→ Check quota / budget
→ Allow or reject
Challenge
Output tokens are unknown before generation.
👉 Interview Answer
Quota enforcement protects both the platform and customers.
The system should check estimated usage before inference and reconcile actual usage after completion.
Output token cost is harder because it is only known after generation.
1️⃣2️⃣ Prepaid Credits vs Postpaid Billing
Prepaid
Users buy credits first.
Credit balance
→ Deduct usage
→ Stop when balance is low
Advantages
- Lower financial risk
- Easier for consumer products
- Hard spending limit
Postpaid
Users are billed after usage.
Usage accumulated
→ Monthly invoice
Advantages
- Better for enterprise customers
- Easier high-volume usage
- Supports contracts and discounts
👉 Interview Answer
Token billing systems can support prepaid or postpaid models.
Prepaid credits reduce financial risk and enforce hard limits.
Postpaid billing is common for enterprise customers with invoices, contracts, and volume discounts.
1️⃣3️⃣ Streaming Billing
Why Streaming Is Special
Output tokens are generated gradually.
The system may stream tokens before knowing final usage.
Streaming Billing Flow
Start request
→ Count input tokens
→ Stream output tokens
→ Count generated tokens
→ Emit final usage event
Cancellation Case
Client cancels after 200 tokens
→ Bill 200 generated tokens
👉 Interview Answer
Streaming responses require billing based on actual generated tokens.
The system should count output tokens as they are produced and emit a final usage event when the stream completes, fails, or is cancelled.
1️⃣4️⃣ Aggregation
Why Aggregate Usage?
Raw events are too detailed for invoices.
Aggregation summarizes usage by:
- Organization
- User
- API key
- Model
- Time window
- Token type
- Region
- Product feature
Example
org_123
model = large-model
date = 2026-05-24
input_tokens = 10M
output_tokens = 2M
cost = $X
👉 Interview Answer
Usage aggregation converts raw request-level usage events into billing summaries.
Aggregation usually groups by organization, model, token type, time window, API key, and billing period.
1️⃣5️⃣ Auditability
Why Audit Matters
Customers may dispute bills.
The system must prove:
- Which requests were billed
- How many tokens were counted
- Which model was used
- Which price applied
- Which discounts applied
- When events were processed
Audit Record
Request ID
Usage event
Token counts
Pricing rule version
Calculated cost
Processing timestamp
👉 Interview Answer
Billing systems must be auditable.
For each charge, the system should be able to trace the request, token counts, model, pricing rule, discounts, and processing history.
1️⃣6️⃣ Best Practices
Practical Rules
- Meter usage at request level
- Use durable usage events
- Make billing processing idempotent
- Separate metering from rating
- Track pricing rule versions
- Support replay and backfill
- Reconcile estimated vs actual usage
- Enforce quotas before inference
- Store audit records
- Monitor billing anomalies
Design Principle
Meter first.
Rate later.
Bill with auditability.
👉 Interview Answer
A good token-based billing system separates metering, rating, aggregation, and invoicing.
It should use durable events, idempotency, audit logs, quota enforcement, and pricing-rule versioning.
🧠 Staff-Level Answer Final
👉 Interview Answer Full Version
Token-based billing is used for LLM systems because infrastructure cost is strongly related to token usage.
A production billing system needs to meter input tokens, output tokens, cached tokens, embedding tokens, model type, organization, timestamp, and other pricing dimensions.
The high-level architecture usually includes an API gateway, quota checker, token meter, inference service, usage event producer, durable event stream, usage processor, aggregation store, rating engine, and billing or invoicing system.
During the request lifecycle, the system counts input tokens before inference, tracks output tokens during generation, and emits a usage event when the request completes, fails, times out, or is cancelled.
Streaming responses require special care because output tokens are produced gradually.
The system should bill based on the actual number of generated tokens, even if the client disconnects early.
The usage event pipeline must be durable and replayable.
Billing should not depend only on in-memory state.
Since retries and duplicate events can happen, idempotency and deduplication are critical to avoid double billing.
I would separate metering from rating.
Metering records raw usage.
Rating applies pricing rules, discounts, credits, free tier rules, currency handling, and billing-period logic.
Quota and budget enforcement should happen before inference using estimated usage, then actual usage should be reconciled after completion.
Because output tokens are unknown in advance, the system may reserve budget upfront and settle the final amount later.
Finally, auditability is essential.
Every charge should be traceable back to request ID, token counts, model, pricing rule version, discounts, and processing timestamps.
The core principle is: meter first, rate later, and bill with auditability.
⭐ Final Insight
Token-based Billing 的核心不是简单地:
“数一下 token,然后乘价格”
真正的 production billing system 是:
Token Metering
- Usage Events
- Durable Stream
- Idempotency
- Aggregation
- Rating Engine
- Quota Enforcement
- Audit Logs
- Invoice Generation。
最重要的原则是:
Meter first.
Rate later.
Bill with auditability.
中文部分
🎯 Token-based Billing Systems Design
1️⃣ 核心框架
讨论 Token-based Billing Systems 时,我通常从这些方面分析:
- 为什么需要 token-based billing
- Token metering
- Request lifecycle accounting
- Pricing and rating engine
- Usage aggregation
- Quota and budget enforcement
- Billing correctness and idempotency
- 核心权衡:accuracy vs latency vs cost
2️⃣ 为什么需要 Token-based Billing?
LLM cost 和 token usage 高度相关。
一个 request 通常消耗:
- Input tokens
- Output tokens
- Cached tokens
- Tool-call tokens
- Embedding tokens
- Reasoning tokens
- Multimodal tokens
Basic Billing Flow
User Request
→ Count Tokens
→ Run Inference
→ Count Output Tokens
→ Calculate Cost
→ Store Usage Record
→ Aggregate Bill
👉 面试回答
Token-based billing 被使用, 是因为 LLM infrastructure cost 和 token usage 强相关。
系统会计量 input tokens、output tokens、 model type、cached tokens 和其他 usage dimensions, 然后通过 pricing engine 转换成 cost。
3️⃣ 什么是 Token?
Token Definition
Token 是 model 处理文本的基本单位。
它可能是:
- 一个词
- 一个词的一部分
- 标点
- 空格
- Special token
Example
"Hello world"
→ ["Hello", " world"]
为什么重要?
Token count 影响:
- Latency
- Cost
- Context window usage
- GPU memory
- Inference time
👉 面试回答
Token 是 model 处理文本的基本单位。
Billing 基于 tokens, 因为 token count 直接影响 compute、 latency、memory 和 cost。
4️⃣ High-Level Architecture
Architecture
Client
→ API Gateway
→ Auth / Quota Check
→ Request Validator
→ Token Meter
→ Inference Service
→ Usage Event Producer
→ Usage Event Stream
→ Aggregation Service
→ Rating Engine
→ Invoice / Billing System
Core Components
Token Meter
计算 input 和 output tokens。
Usage Event Producer
Request 完成后产生 usage events。
Aggregation Service
按 user、organization、model 和 time window 聚合 usage。
Rating Engine
把 usage 转换成 cost。
Billing System
生成 invoice 或 charge。
👉 面试回答
Token billing system 通常包括 token metering、 usage event generation、streaming ingestion、 aggregation、rating、quota enforcement 和 invoice generation。
关键要求是 usage accounting 必须 accurate、reliable 和 auditable。
5️⃣ Token Metering
需要计量什么?
系统应该计量:
- Prompt input tokens
- Completion output tokens
- Cached input tokens
- Embedding tokens
- Tool-call tokens
- Model name
- Request ID
- Organization ID
- Timestamp
Example Usage Record
{
"request_id": "req_123",
"organization_id": "org_456",
"model": "large-model",
"input_tokens": 1200,
"output_tokens": 300,
"cached_tokens": 600,
"timestamp": "2026-05-24T10:00:00Z"
}
👉 面试回答
Token metering 会记录每个 request 消耗了多少 tokens。
好的 usage record 应该包含 request ID、 organization、model、input tokens、 output tokens、cached tokens、timestamp 和 pricing-relevant dimensions。
6️⃣ Request Lifecycle Accounting
Request Lifecycle
Request accepted
→ Input tokens counted
→ Inference starts
→ Output tokens generated
→ Final token count captured
→ Usage event emitted
Important Cases
系统必须处理:
- Successful requests
- Failed requests
- Cancelled requests
- Streamed responses
- Partial outputs
- Retries
- Timeouts
Example
Client disconnects after 100 output tokens
→ Bill only generated tokens
→ Emit partial usage event
👉 面试回答
Billing 必须跟随完整 request lifecycle。
系统需要正确处理 successful、failed、 cancelled、timed-out、retried 和 partially streamed requests。
7️⃣ Pricing Model
Pricing Dimensions
Pricing 可能取决于:
- Model
- Input tokens
- Output tokens
- Cached tokens
- Embedding tokens
- Batch vs real-time mode
- Region
- Customer tier
- Volume discounts
Example Pricing
Input tokens: $X per million
Output tokens: $Y per million
Cached tokens: discounted
为什么 Output Tokens 可能更贵?
Output generation 需要 sequential decoding, 通常比处理 input tokens 更昂贵。
👉 面试回答
Pricing model 把 measured usage 转换成 cost。
Pricing 可能根据 model、input tokens、 output tokens、cached tokens、batch mode、 region 和 customer tier 不同而变化。
8️⃣ Rating Engine
什么是 Rating?
Rating 把 usage 转换成 billable cost。
Usage Event
→ Pricing Rules
→ Cost
Rating Example
{
"input_tokens": 1000,
"output_tokens": 500,
"input_price": 0.000001,
"output_price": 0.000003,
"total_cost": 0.0025
}
Rating Engine Responsibilities
- Apply pricing rules
- Apply discounts
- Apply free tier
- Apply credits
- Handle currency
- Handle billing periods
- Preserve audit trail
👉 面试回答
Rating engine 会把 pricing rules 应用到 token usage 上。
它负责计算 cost, 应用 discounts、credits、 free tier rules, 并存储 auditable cost records。
9️⃣ Usage Event Pipeline
为什么需要 Event Pipeline?
Billing data 必须 durable and reliable。
Event Flow
Inference Service
→ Usage Event
→ Durable Queue / Stream
→ Usage Processor
→ Aggregation Store
→ Billing System
Requirements
- Durable delivery
- Idempotent processing
- Ordering where needed
- Replay support
- Dead-letter queue
- Audit logs
👉 面试回答
Usage events 应该进入 durable event pipeline。
这支持 reliable processing、retries、 replay、aggregation 和 auditing。
Billing events 不应该只依赖 in-memory state。
🔟 Idempotency and Deduplication
为什么重要?
Requests 可能 retry。
Events 可能被重复 deliver。
如果没有 deduplication, customers 可能被 double-billed。
Idempotency Key
request_id + usage_event_type + attempt_id
Deduplication Flow
Usage event received
→ Check idempotency key
→ If already processed, skip
→ Else process and store
👉 面试回答
Idempotency 对 billing systems 非常关键。
因为 requests 和 events 都可能 retry, 系统必须使用 stable request IDs 或 idempotency keys 去重 usage events, 避免 double charging。
1️⃣1️⃣ Quota and Budget Enforcement
为什么 Quotas 重要?
Users 和 organizations 需要控制 spending。
Common Controls
- Token-per-minute limit
- Request-per-minute limit
- Daily budget
- Monthly budget
- Model-specific quota
- Organization-level quota
- Hard and soft limits
Flow
Request arrives
→ Estimate token usage
→ Check quota / budget
→ Allow or reject
Challenge
Output tokens 在 generation 前未知。
👉 面试回答
Quota enforcement 保护 platform 和 customers。
系统应该在 inference 前 基于 estimated usage 做检查, 并在 completion 后用 actual usage reconcile。
Output token cost 更难处理, 因为 generation 后才知道。
1️⃣2️⃣ Prepaid Credits vs Postpaid Billing
Prepaid
Users 先购买 credits。
Credit balance
→ Deduct usage
→ Stop when balance is low
Advantages
- Lower financial risk
- Easier for consumer products
- Hard spending limit
Postpaid
Users 先使用,之后付费。
Usage accumulated
→ Monthly invoice
Advantages
- Better for enterprise customers
- Easier high-volume usage
- Supports contracts and discounts
👉 面试回答
Token billing systems 可以支持 prepaid 或 postpaid models。
Prepaid credits 降低 financial risk, 并支持 hard limits。
Postpaid billing 常用于 enterprise customers, 支持 invoices、contracts 和 volume discounts。
1️⃣3️⃣ Streaming Billing
为什么 Streaming 特殊?
Output tokens 是逐渐生成的。
系统可能在知道最终 usage 前 已经开始 stream tokens。
Streaming Billing Flow
Start request
→ Count input tokens
→ Stream output tokens
→ Count generated tokens
→ Emit final usage event
Cancellation Case
Client cancels after 200 tokens
→ Bill 200 generated tokens
👉 面试回答
Streaming responses 需要基于实际 generated tokens 计费。
系统应该在 tokens 生成时计数, 并在 stream completed、failed 或 cancelled 时发送 final usage event。
1️⃣4️⃣ Aggregation
为什么要 Aggregate Usage?
Raw events 对 invoices 来说太细。
Aggregation 按以下维度汇总 usage:
- Organization
- User
- API key
- Model
- Time window
- Token type
- Region
- Product feature
Example
org_123
model = large-model
date = 2026-05-24
input_tokens = 10M
output_tokens = 2M
cost = $X
👉 面试回答
Usage aggregation 把 request-level raw events 转换成 billing summaries。
Aggregation 通常按 organization、model、 token type、time window、API key 和 billing period 分组。
1️⃣5️⃣ Auditability
为什么 Audit 重要?
Customers 可能 dispute bills。
系统必须证明:
- 哪些 requests 被 billed
- Count 了多少 tokens
- 使用了哪个 model
- 应用了哪个 price
- 应用了哪些 discounts
- Events 何时被处理
Audit Record
Request ID
Usage event
Token counts
Pricing rule version
Calculated cost
Processing timestamp
👉 面试回答
Billing systems 必须 auditable。
对每一笔 charge, 系统都应该能追踪 request、token counts、 model、pricing rule、discounts 和 processing history。
1️⃣6️⃣ Best Practices
Practical Rules
- Meter usage at request level
- Use durable usage events
- Make billing processing idempotent
- Separate metering from rating
- Track pricing rule versions
- Support replay and backfill
- Reconcile estimated vs actual usage
- Enforce quotas before inference
- Store audit records
- Monitor billing anomalies
Design Principle
Meter first.
Rate later.
Bill with auditability.
👉 面试回答
好的 token-based billing system 会分离 metering、rating、aggregation 和 invoicing。
它应该使用 durable events、idempotency、 audit logs、quota enforcement 和 pricing-rule versioning。
🧠 Staff-Level Answer Final
👉 面试回答完整版本
Token-based billing 用于 LLM systems, 是因为 infrastructure cost 和 token usage 高度相关。
Production billing system 需要计量 input tokens、output tokens、 cached tokens、embedding tokens、 model type、organization、timestamp 和其他 pricing dimensions。
High-level architecture 通常包含 API gateway、quota checker、token meter、 inference service、usage event producer、 durable event stream、usage processor、 aggregation store、rating engine 和 billing 或 invoicing system。
在 request lifecycle 中, 系统会在 inference 前计算 input tokens, 在 generation 期间追踪 output tokens, 并在 request completed、failed、timed out 或 cancelled 时发送 usage event。
Streaming responses 需要特别处理, 因为 output tokens 是逐渐产生的。
即使 client early disconnect, 系统也应该根据实际生成的 tokens 计费。
Usage event pipeline 必须 durable and replayable。
Billing 不应该只依赖 in-memory state。
因为 retries 和 duplicate events 可能发生, idempotency 和 deduplication 非常关键, 用来避免 double billing。
我会把 metering 和 rating 分开。
Metering 记录 raw usage。
Rating 应用 pricing rules、discounts、 credits、free tier rules、currency handling 和 billing-period logic。
Quota 和 budget enforcement 应该在 inference 前基于 estimated usage 执行, 然后在 completion 后用 actual usage reconcile。
因为 output tokens 事前未知, 系统可以 upfront reserve budget, 之后再 settle final amount。
最后,auditability 非常重要。
每一笔 charge 都应该能追溯到 request ID、 token counts、model、pricing rule version、 discounts 和 processing timestamps。
核心原则是: meter first, rate later, and bill with auditability。
⭐ Final Insight
Token-based Billing 的核心不是简单地:
“数一下 token,然后乘价格”
真正的 production billing system 是:
Token Metering
- Usage Events
- Durable Stream
- Idempotency
- Aggregation
- Rating Engine
- Quota Enforcement
- Audit Logs
- Invoice Generation。
最重要的原则是:
Meter first.
Rate later.
Bill with auditability.
Implement